Real-Time Speech Translation (Earbuds & WebRTC) Engineering Cheat Sheet

One-page reference: the four-step pipeline (hear, ASR, MT, TTS); earbuds as an audio endpoint in front of that pipeline (on-device vs phone+cloud); cascade vs direct speech-to-speech trade-offs; why live translations rewrite themselves (word reorder, ear-voice span, wait-k, replace-until-final); the latency budget (~1 s machine + ~2-3 s deliberate wait; ITU-T G.114 150 ms / <800 ms feels live / AIIC 3-5 s ceiling); SFU-side per-listener-language fan-out (captions on RFC 8831 data channel, ducked audio on an injected RFC 6716 Opus track, VAD-gated); the per-language cost math ($21.60 for a 1-hour 3-language event); and a build-vs-buy checklist.

Download free PDF

PDF

Specialist software house for video, real-time and AI products. Founded 2005. 50 in-house engineers.

+1 (914) 775-5855
New York · USA
© Fora Soft, 20052026
Describe your project and we will get in touch
Enter your message
Enter your email
Enter your name

By submitting data in this form, you agree with the Personal Data Processing Policy.

Your message has been sent successfully
We will contact you soon
Message not sent. Please try again.