A transcription pipeline is the chain of steps that turns live call audio into text and stored artefacts: branch the audio off the mix or individual speaker streams, segment it into utterances with voice activity detection, run automatic speech recognition, and then post-process into captions, searchable transcripts, summaries, and the recording itself. Doing it per-speaker rather than from a single mix gives accurate attribution of who said what, and good speech enhancement upstream lifts accuracy throughout. The pipeline is where audio leaves the real-time call and becomes durable data, raising design questions about latency (live captions versus after-the-fact transcripts), privacy and consent, storage, and where the recognition runs.

