ASR (Automatic Speech Recognition), or speech-to-text, turns the audio of a call or recording into written text for live captions, searchable transcripts, summaries, and voice commands. In a conferencing pipeline it sits downstream of the media: audio is branched off the mix or per-speaker streams, cleaned and segmented (usually using VAD to find utterance boundaries), then fed to a recognition model that may run in the cloud or on-device. Modern ASR uses deep neural networks and large speech models, and its accuracy depends heavily on the audio quality it receives — which is why noise suppression, echo cancellation, and good capture upstream directly improve the transcript downstream.