Senior Streaming ASR/TTS Engineer

Project Overview:
We are building a real-time voice agent (voice→text→voice) integrated with our proprietary software stack. We need a senior part-time engineer for a few months to select, deploy, optimize, and fine-tune open-source ASR and TTS models for ultra-low-latency streaming and production-grade stability.

Primary languages: English, Ukrainian, Polish, Russian, German, Spanish, French, Portuguese (+ others as needed). A critical requirement is high-quality Slavic stress/pronunciation (especially UA/PL/RU) and robust handling of mixed-language inputs (code-switching).

Must-have:

Proven production experience with streaming ASR and/or streaming TTS (real demos/links required).
Strong real-time audio knowledge: VAD, endpointing, chunking, partial/final hypotheses, buffering/jitter, barge-in (interrupt handling).
Practical approach to text normalization (numbers/dates/currency/abbreviations) and Slavic stress/pronunciation (dictionaries/rules/G2P/SSML).
Linux + Docker, reproducible deployments, clean documentation.

Responsibilities:

Select and justify an open-source ASR/TTS stack for streaming and multilingual performance.
Implement streaming ASR: partial/final, VAD + endpointing, stability, metrics.
Implement streaming TTS: low time-to-first-audio, stable chunked audio streaming (no dropouts), prosody control.
Build a text normalization + pronunciation/stress layer with tests (focus on UA/PL/RU).
Create training/fine-tuning pipelines on our private data + evaluation/regression.
Integrate into our agent pipeline (voice→text→LLM→text→voice), including barge-in and turn-taking.
Deliver Dockerized services + runbook (deploy/train/evaluate/debug).

Deliverables:

Dockerized Streaming ASR service + documented API.
Dockerized Streaming TTS service + chunked audio streaming.
Text normalization + pronunciation/stress module (configurable dictionaries/rules, tests).
Reproducible ASR/TTS training/fine-tuning pipelines.
Evaluation + regression suite (latency + WER/CER + Slavic stress checklist).
End-to-end integration demo with our agent + instructions.
Documentation (architecture notes + runbook).

Acceptance Criteria (measured):

Latency: ASR time-to-first-partial; TTS time-to-first-audio; end-to-end “user stops → agent starts speaking”.
ASR quality: WER/CER on multilingual test sets (EN/UA/PL/RU/DE/ES/FR/PT).
Pronunciation/stress: pass rate on a curated Slavic stress/pronunciation checklist (UA/PL/RU).
Stability: reliable barge-in and turn-taking in real-time streaming.
Reproducibility: Docker deployment + repeatable training runs.

Required skills experience

AI/ML	1 year
speech synthesis (TTS)	1 year
ASR	1 year
Voice AI Agent	1 year
Low-latency	1 year

+ 1 more

Audio Signal Processing

1 year

Required domain experience

Education

6 months

Required languages

English

B2 - Upper Intermediate

Published 30 December 2025

15 views

2 applications

100% read

To apply for this and other jobs on Djinni login or signup.

Only from 1 year of experience
Full Remote
Worldwide
Countries where we consider candidates
- English B2 - Upper Intermediate

(Other)

AI/ML	1 year
speech synthesis (TTS)	1 year
ASR	1 year

+ 3 more

Employment: Part-time
Domain: Education
Product

Apply for the job

100% read

0% responded

📊 $700-2000 Average salary range of similar jobs in analytics →