Senior Streaming ASR/TTS Engineer
Project Overview:
We are building a real-time voice agent (voice→text→voice) integrated with our proprietary software stack. We need a senior part-time engineer for a few months to select, deploy, optimize, and fine-tune open-source ASR and TTS models for ultra-low-latency streaming and production-grade stability.
Primary languages: English, Ukrainian, Polish, Russian, German, Spanish, French, Portuguese (+ others as needed). A critical requirement is high-quality Slavic stress/pronunciation (especially UA/PL/RU) and robust handling of mixed-language inputs (code-switching).
Must-have:
- Proven production experience with streaming ASR and/or streaming TTS (real demos/links required).
- Strong real-time audio knowledge: VAD, endpointing, chunking, partial/final hypotheses, buffering/jitter, barge-in (interrupt handling).
- Practical approach to text normalization (numbers/dates/currency/abbreviations) and Slavic stress/pronunciation (dictionaries/rules/G2P/SSML).
- Linux + Docker, reproducible deployments, clean documentation.
Responsibilities:
- Select and justify an open-source ASR/TTS stack for streaming and multilingual performance.
- Implement streaming ASR: partial/final, VAD + endpointing, stability, metrics.
- Implement streaming TTS: low time-to-first-audio, stable chunked audio streaming (no dropouts), prosody control.
- Build a text normalization + pronunciation/stress layer with tests (focus on UA/PL/RU).
- Create training/fine-tuning pipelines on our private data + evaluation/regression.
- Integrate into our agent pipeline (voice→text→LLM→text→voice), including barge-in and turn-taking.
- Deliver Dockerized services + runbook (deploy/train/evaluate/debug).
Deliverables:
- Dockerized Streaming ASR service + documented API.
- Dockerized Streaming TTS service + chunked audio streaming.
- Text normalization + pronunciation/stress module (configurable dictionaries/rules, tests).
- Reproducible ASR/TTS training/fine-tuning pipelines.
- Evaluation + regression suite (latency + WER/CER + Slavic stress checklist).
- End-to-end integration demo with our agent + instructions.
- Documentation (architecture notes + runbook).
Acceptance Criteria (measured):
- Latency: ASR time-to-first-partial; TTS time-to-first-audio; end-to-end “user stops → agent starts speaking”.
- ASR quality: WER/CER on multilingual test sets (EN/UA/PL/RU/DE/ES/FR/PT).
- Pronunciation/stress: pass rate on a curated Slavic stress/pronunciation checklist (UA/PL/RU).
- Stability: reliable barge-in and turn-taking in real-time streaming.
- Reproducibility: Docker deployment + repeatable training runs.
Required skills experience
| AI/ML | 1 year |
| speech synthesis (TTS) | 1 year |
| ASR | 1 year |
| Voice AI Agent | 1 year |
| Low-latency | 1 year |
+ 1 more
| Audio Signal Processing | 1 year |
Required domain experience
| Education | 6 months |
Required languages
| English | B2 - Upper Intermediate |
Published 30 December
3 views
·
1 application
📊
$800-2000
Average salary range of similar jobs in
analytics →
Loading...