ASR converts speech into text the software can use. Quality is measured by word error rate, and the hard cases are noise, accents, overlap, and jargon. It is the first step in nearly every audio AI feature in a video product.