Senior Streaming ASR/TTS Engineer

Project Overview:
We are building a real-time voice agent (voice→text→voice) integrated with our proprietary software stack. We need a senior part-time engineer for a few months to select, deploy, optimize, and fine-tune open-source ASR and TTS models for ultra-low-latency streaming and production-grade stability.

Primary languages: English, Ukrainian, Polish, Russian, German, Spanish, French, Portuguese (+ others as needed). A critical requirement is high-quality Slavic stress/pronunciation (especially UA/PL/RU) and robust handling of mixed-language inputs (code-switching).

Must-have:

  • Proven production experience with streaming ASR and/or streaming TTS (real demos/links required).
  • Strong real-time audio knowledge: VAD, endpointing, chunking, partial/final hypotheses, buffering/jitter, barge-in (interrupt handling).
  • Practical approach to text normalization (numbers/dates/currency/abbreviations) and Slavic stress/pronunciation (dictionaries/rules/G2P/SSML).
  • Linux + Docker, reproducible deployments, clean documentation.

Responsibilities:

  • Select and justify an open-source ASR/TTS stack for streaming and multilingual performance.
  • Implement streaming ASR: partial/final, VAD + endpointing, stability, metrics.
  • Implement streaming TTS: low time-to-first-audio, stable chunked audio streaming (no dropouts), prosody control.
  • Build a text normalization + pronunciation/stress layer with tests (focus on UA/PL/RU).
  • Create training/fine-tuning pipelines on our private data + evaluation/regression.
  • Integrate into our agent pipeline (voice→text→LLM→text→voice), including barge-in and turn-taking.
  • Deliver Dockerized services + runbook (deploy/train/evaluate/debug).

Deliverables:

  1. Dockerized Streaming ASR service + documented API.
  2. Dockerized Streaming TTS service + chunked audio streaming.
  3. Text normalization + pronunciation/stress module (configurable dictionaries/rules, tests).
  4. Reproducible ASR/TTS training/fine-tuning pipelines.
  5. Evaluation + regression suite (latency + WER/CER + Slavic stress checklist).
  6. End-to-end integration demo with our agent + instructions.
  7. Documentation (architecture notes + runbook).

Acceptance Criteria (measured):

  • Latency: ASR time-to-first-partial; TTS time-to-first-audio; end-to-end “user stops → agent starts speaking”.
  • ASR quality: WER/CER on multilingual test sets (EN/UA/PL/RU/DE/ES/FR/PT).
  • Pronunciation/stress: pass rate on a curated Slavic stress/pronunciation checklist (UA/PL/RU).
  • Stability: reliable barge-in and turn-taking in real-time streaming.
  • Reproducibility: Docker deployment + repeatable training runs.

Required skills experience

AI/ML 1 year
speech synthesis (TTS) 1 year
ASR 1 year
Voice AI Agent 1 year
Low-latency 1 year
Audio Signal Processing 1 year

Required domain experience

Education 6 months

Required languages

English B2 - Upper Intermediate
Published 30 December
3 views
·
1 application
To apply for this and other jobs on Djinni login or signup.
Loading...