One-page reference: the four-step pipeline (hear, ASR, MT, TTS); earbuds as an audio endpoint in front of that pipeline (on-device vs phone+cloud); cascade vs direct speech-to-speech trade-offs; why live translations rewrite themselves (word reorder, ear-voice span, wait-k, replace-until-final); the latency budget (~1 s machine + ~2-3 s deliberate wait; ITU-T G.114 150 ms / <800 ms feels live / AIIC 3-5 s ceiling); SFU-side per-listener-language fan-out (captions on RFC 8831 data channel, ducked audio on an injected RFC 6716 Opus track, VAD-gated); the per-language cost math ($21.60 for a 1-hour 3-language event); and a build-vs-buy checklist.
Download free PDF