Published 2026-05-31 · 19 min read · By Nikolay Sapunov, CEO at Fora Soft
Why This Matters
If your product needs to know when each word was spoken or who spoke it, plain Whisper will quietly let you down, and WhisperX is the most common open-source fix. Subtitles that highlight each word as it is read — the karaoke effect on social clips — need per-word timing. A meeting summary that says "Maria committed to the deadline, not Tom" needs speaker labels. A telemedicine record that separates the doctor's voice from the patient's needs both. A searchable video archive that jumps you to the exact second a phrase was said needs accurate word timestamps, not Whisper's rough guesses that can be several seconds off — and those word-timed transcripts are the backbone of the retrieval index covered in the video RAG over an archive lesson. This article is for the product manager, founder, or engineering lead deciding whether to self-host WhisperX or buy a hosted transcription API. By the end you will understand the three stages WhisperX adds, the specific ways each one fails, the cost and operational trade-offs against a paid API, and the four-question test for picking the right path.
Where WhisperX Sits — And What Came Before It
This lesson builds directly on the previous one. In the streaming ASR in production lesson we compared three ways to turn live audio into text while a person is still talking. WhisperX solves a different, narrower problem: it transcribes a finished recording — a file you already have — and it does so with two qualities that plain Whisper cannot deliver on its own. The first is an accurate timestamp for every word. The second is a speaker label for every word. WhisperX is a batch tool, not a streaming one, and that single fact shapes everything about where it fits. (For the live-caption case, where a transcript is produced and fanned out to viewers in real time, see the live captions SFU fan-out lesson instead.)
Start with the word transcription. It means turning recorded speech into the text of what was said. Automatic speech recognition, abbreviated ASR, is the technology that does it. OpenAI's Whisper, released in 2022 under the permissive MIT license, is the open-source ASR model most teams reach for first: it is free to run on your own servers, handles 99 languages, and holds up well against accents and background noise (Radford et al., 2022). But Whisper was built to answer one question — what words were said? — and it answers two other questions badly.
The first question it answers badly is exactly when was each word said? Whisper produces a timestamp for each chunk of text it emits, but those timestamps are known to drift, sometimes by several seconds, and Whisper does not give you a timestamp for each individual word at all (Bain et al., 2023). The second question it cannot answer at all is who said it? Whisper transcribes a two-person conversation into one undifferentiated wall of text, with no idea that two people were talking.
WhisperX, published by Max Bain and colleagues at Oxford's Visual Geometry Group in 2023, exists to answer all three questions (Bain et al., 2023). It keeps Whisper as the engine that decides what was said, and adds machinery around it to decide when and who.
The Three Stages WhisperX Adds
The cleanest way to understand WhisperX is as Whisper plus three bolt-on stages, run in order. Picture an assembly line. The raw audio enters on the left, passes through four stations, and a fully timed, speaker-labelled transcript comes out on the right.
The first station is a voice activity detector, abbreviated VAD — a small, cheap model whose only job is to mark which slices of the audio contain speech and which are silence, music, or noise. The second station is Whisper itself, transcribing the speech into words. The third station is forced alignment — a separate model that snaps each word onto its true position in the sound. The fourth station is diarization — a model that figures out how many distinct voices are present and which stretches of audio belong to each. We will walk through each one, because each adds a real capability and each fails in a specific way you need to plan around.
Figure 1. WhisperX is Whisper plus three stages: a voice-activity detector finds the speech, forced alignment fixes the timing, and diarization assigns the speakers.
Stage One — Voice Activity Detection And The "Cut & Merge" Trick
Recall the one hard fact about Whisper from the streaming lesson: the model reads audio in fixed blocks of 30 seconds, because that is the length it was trained on. To transcribe a one-hour podcast, plain Whisper slides a 30-second window across the recording. The problem is that the window has to land somewhere, and it often lands in the middle of a word. A word cut in half at a window boundary gets mis-transcribed, and worse, Whisper uses its own shaky timestamps to decide where to slide the window next — so an error in one window pushes the next window to the wrong place, and the mistakes pile up. The technical name for this pile-up is drift (Bain et al., 2023).
WhisperX removes the guesswork by looking at the audio before Whisper does. It runs a voice activity detector first. The VAD produces a map of exactly where speech starts and stops across the whole recording. Now WhisperX can chop the audio at moments of silence, never in the middle of a word.
But there is a subtlety, and it is the clever part of the design. Whisper works best with close to 30 seconds of context, so feeding it lots of tiny one-second speech snippets would waste its ability and hurt accuracy. So WhisperX applies what the paper calls Cut & Merge. It cuts any speech region longer than 30 seconds at the quietest point inside it — the moment of least vocal activity, which is the safest place to split. Then it merges short neighbouring speech regions together until each combined chunk is close to, but not over, 30 seconds (Bain et al., 2023). The result is a set of chunks that are each about 30 seconds long and each begin and end on silence.
This buys two things at once. Because no chunk depends on any other chunk's timestamp, they no longer have to be processed in sequence — Whisper can transcribe all of them in parallel, in a batch, which is much faster. And because every chunk is cut on silence and sized to Whisper's sweet spot, the transcription is also more accurate, with less drift and fewer hallucinated words (Bain et al., 2023). The paper reports this batched approach gave a roughly twelve-fold speed-up over Whisper's sequential sliding window (Bain et al., 2023). The current open-source version pushes that further — its README reports up to 70 times faster than real time on the large model — by also swapping in a faster inference engine called faster-whisper, the same CTranslate2-based runtime we met in the streaming lesson (WhisperX GitHub, 2026).
A Note On The Two Speed Numbers
You will see two different speed claims for WhisperX and they are both true. The paper measured 11.8 times Whisper's own speed on an NVIDIA A40 GPU — close to the "twelve-fold" the authors headline — and, crucially, with no loss in transcription quality; the same VAD Cut & Merge that enables the batching also lowered word error rate and cut hallucination (Bain et al., 2023, Tables 2 and 3). The "70× real time" on the current GitHub page is a later engineering figure that combines that batching with the faster-whisper runtime, measured against a different baseline (WhisperX GitHub, 2026). Neither is wrong; they measure different things. The honest takeaway is that WhisperX turns a recording that plain Whisper transcribes slowly into one that finishes many times faster — the exact multiple depends on your GPU, the model size, and the batch you choose.
Stage Two — Forced Alignment: Pinning Each Word To The Sound
This is the stage that gives WhisperX its name and its main reason to exist. After Whisper has produced the words, WhisperX runs a second, smaller model over the same audio to find the precise start and end time of every single word. The technique is called forced alignment.
The idea is best understood by analogy. Imagine you have the script of a song and the recording, and you want to mark the exact instant each word is sung — the way a karaoke machine highlights lyrics in time. You are not guessing the words; you already know them from the script. You are only solving the timing: matching known text to known sound. That matching problem is forced alignment, and it is much easier and more precise than transcription, because the answer to "what was said" is already fixed.
WhisperX does this with a model from a family called wav2vec2, trained to recognise the individual speech sounds — the phonemes — in audio. For English, the default is a standard wav2vec2 model trained on 960 hours of audiobook speech, the model the paper found performs most consistently (Bain et al., 2023). The model scans the audio and, for each tiny slice, scores how likely each phoneme is. WhisperX then runs dynamic time warping over that score map — an algorithm that finds the single best time-path matching the sequence of phonemes in Whisper's transcript to the sound — and takes the start of a word's first phoneme and the end of its last phoneme as that word's timestamps (Bain et al., 2023).
The payoff shows up in a measurable way. The paper evaluates timing with a word-segmentation test: a predicted word counts as correct only if its time window overlaps the human-labelled window within a 200-millisecond tolerance (a "collar") and the word text is an exact match. On that test, over telephone speech (the Switchboard corpus), WhisperX reaches 93.2% precision and 65.4% recall, versus 85.4% and 62.8% for using Whisper's own timestamps; on harder meeting audio (the AMI corpus) WhisperX reaches 84.1% precision and 60.3% recall, versus Whisper's 78.9% and 52.1% (Bain et al., 2023, Table 2). In plain terms: Whisper alone misses or mis-places far more word boundaries, and it even falls behind the much smaller wav2vec2 model on AMI precision — which is why WhisperX runs a dedicated alignment stage rather than trusting Whisper's timestamps. For karaoke captions, clip-highlighting, or jumping to the exact second of a search hit, that gap is the whole product.
Figure 2. Forced alignment matches known words to known sounds, snapping each word from a drifted segment guess onto its true position — 93.2% precision on telephone speech at a 200 ms collar, versus 85.4% for Whisper alone.
The Forced-Alignment Pitfall You Will Hit
Forced alignment has one failure mode you must design around. The wav2vec2 model only knows the sounds of ordinary spoken words. When Whisper transcribes something that is not a normal dictionary word — a number written as digits like "2014", a currency amount like "£13.60", a symbol, or a word in a language the aligner does not cover — the aligner has no phonemes to match and cannot place that token. WhisperX then borrows the timestamp of the nearest neighbouring phoneme in the transcript for that word (Bain et al., 2023), so a transcript that is otherwise tightly aligned will have a few words pinned only loosely. If your product depends on perfect per-word timing — say, a legal transcript with timestamped figures — test specifically on audio full of numbers and names, because that is where alignment quietly degrades.
A second, related caution: the alignment model, wav2vec2, is less robust to noise than Whisper itself. On clean studio audio the timestamps are excellent; on a noisy phone call they degrade faster than the transcription does (Towards AI, 2026). Clean audio in, clean timing out — which is why teams that run WhisperX on call recordings often clean the audio first with the techniques in the real-time noise suppression lesson.
Stage Three — Diarization: Labelling Who Spoke
The third stage answers who said it? The word is diarization — splitting an audio recording into segments by speaker, and labelling each segment with an anonymous tag like "Speaker A" and "Speaker B". It does not put names to voices; it only separates them. Attaching real names is a later step your application does, usually by asking the user or matching against known voices.
WhisperX hands this job to a separate, specialised library called pyannote.audio, the open-source standard for speaker diarization (WhisperX GitHub, 2026). Pyannote listens to the whole recording, decides how many distinct voices are present, and produces a timeline of which voice is active when. WhisperX then does the final, simple join: for each word it has already timed in stage two, it looks up which speaker's segment that word's time falls inside, and tags the word with that speaker (WhisperX GitHub, 2026). Because the words are already accurately timed, this overlap-matching is reliable — which is a nice illustration of why the stages run in this order. Accurate timing is what makes accurate speaker assignment possible.
We treat diarization as its own deep topic in the next lesson on Pyannote in production; here the point is only how WhisperX wires it in and what it costs you operationally.
Two Things That Will Trip You Up On Diarization
First, the friction. The pyannote diarization model is gated: to download it you must create a free Hugging Face account, accept the model's terms, and pass an access token to WhisperX with the --diarize flag (WhisperX GitHub, 2026). This is a one-time setup, but it surprises teams who expected pip install to be the end of it, and it is a recurring complaint in the project's issue tracker (WhisperX GitHub issues, 2026). Budget ten minutes for it and document the token handling, because a token that expires in production will silently turn speaker labels off.
Second, the limits. Diarization is the slowest and most memory-hungry stage of the pipeline, and it is the least reliable. It struggles when two people talk over each other — overlapping speech is genuinely hard — and it can merge two similar voices into one or split one voice into two. The WhisperX authors are candid that diarization is "far from perfect" (WhisperX GitHub, 2026). Plan for a human to be able to correct speaker labels in any product where getting them wrong matters, such as a medical or legal record.
The Whole Pipeline In One View
The four stages run as a chain, and each one feeds the next. The table below is the mental model to keep.
| Stage | What it does | Model used | What it gives you | Where it fails |
|---|---|---|---|---|
| 1. Voice activity detection | Finds where speech is; cuts & merges into ~30 s chunks on silence | pyannote VAD | Faster, more accurate, batchable input | Very quiet speech can be missed |
| 2. Transcription | Turns speech into words, in parallel batches | Whisper (via faster-whisper) | The text of what was said | Rare words, heavy accents, noise |
| 3. Forced alignment | Pins each word to its true start/end time | wav2vec2 phoneme model | Word-level timestamps within a 200 ms collar | Numbers, symbols, unsupported languages |
| 4. Diarization | Splits and labels the audio by speaker | pyannote.audio | "Speaker A / B" tag on every word | Overlapping or similar voices; needs HF token |
Sources: Bain et al. (2023); WhisperX GitHub (2026).
WhisperX Versus A Hosted API
The real decision most teams face is not "WhisperX or plain Whisper" — it is "self-host WhisperX, or pay a hosted API like Deepgram or AssemblyAI that gives word timestamps and diarization as a built-in feature". The streaming lesson covered those APIs in depth; here is the comparison that matters specifically for the timestamp-and-speaker job.
| Criterion | WhisperX (self-hosted) | Hosted API (Deepgram / AssemblyAI) |
|---|---|---|
| Cost model | Your GPU time only; software is free | Per-hour fee (~$0.15–$0.46/hr streaming; batch tiers vary) |
| Word timestamps | Forced-aligned (93.2% precision on telephone speech) | Built in, vendor-tuned |
| Diarization | pyannote, free, needs HF token | Built in, one flag |
| Data location | Stays on your servers | Sent to vendor cloud |
| Streaming | No — batch only | Yes |
| Setup effort | GPU, CUDA, models, token, tuning | An API key |
| Languages | 99 transcribe; ~30 aligned out of the box | Vendor-dependent |
| Best when | On-prem rules; high steady volume; full control | Ship fast; spiky volume; want streaming |
Sources: Bain et al. (2023); WhisperX GitHub (2026); Deepgram and AssemblyAI pricing as cited in the streaming lesson.
The license is part of this decision. WhisperX is released under the permissive BSD-2-Clause license, Whisper under MIT, and pyannote's code under MIT — all of which let you ship them inside a commercial product for free (WhisperX GitHub, 2026). The only catch is the gated pyannote model weights, which are free but require accepting their terms. So "free" is true, with an asterisk: free software, free models, but you pay for the GPU and the engineer who runs it.
Worked Example — What Self-Hosting WhisperX Actually Costs
Numbers in the abstract do not settle a budget, so put real ones through. Suppose you need to process 2,000 hours of recorded audio per month — a podcast network, a course library, or a backlog of recorded consultations — and you want word timestamps and speaker labels on all of it. This is batch work: the files already exist, so latency does not matter, only throughput and cost.
WhisperX on the large model runs at up to 70 times real time on a capable GPU, but use a conservative, real-world figure of 30 times real time once diarization and alignment are included, since those stages are slower than transcription (WhisperX GitHub, 2026). At 30× real time, one GPU processes 30 hours of audio per wall-clock hour:
2,000 hours of audio ÷ 30 hours processed per GPU-hour = 66.7 GPU-hours per month.
Rent a suitable cloud GPU at about $1.00 per hour and the raw compute bill is:
66.7 GPU-hours × $1.00/hour = about $67 per month — in compute.
That is strikingly cheap, and it is why self-hosting WhisperX is attractive for steady batch volume. But the compute number is not the whole cost. Add the one-time engineering to build the pipeline, the ongoing maintenance, the GPU you may keep partly idle to absorb peaks, and the human time to spot-check diarization. For comparison, a hosted batch API at, say, $0.20 per hour would bill 2,000 × $0.20 = $400 per month — six times the WhisperX compute, but with zero pipeline to build or run.
The lesson is the mirror image of the streaming lesson's. For batch work at steady volume, WhisperX's per-hour compute cost is far below a hosted API, and the break-even tips toward self-hosting much sooner than it does for streaming — because you are not paying for idle GPUs waiting on live calls, you are running a queue flat-out. The hosted API still wins when volume is low or spiky, when you have no ML engineer, or when you need it shipped this week.
Common Mistake — Trusting Word Timestamps On Hard Tokens
The most damaging mistake we see with WhisperX is assuming every word timestamp is equally trustworthy. In a demo on clean speech, the timing looks flawless, so teams build features — a clickable transcript, a captioning engine, a clip-cutter — that treat every word's time as exact. Then real audio arrives full of prices, dates, phone numbers, and proper nouns, and a fraction of those tokens were never aligned at all. They borrow a neighbouring word's timestamp instead, and the clickable transcript jumps to the wrong second, or the auto-cut clip starts mid-word.
The fix is to know which words were truly aligned and which fell back. WhisperX's output marks this — an aligned word carries a tight start and end; an unaligned one inherits a neighbour's bounds. Build your feature to read that distinction: for anything irreversible or user-facing, treat a fallback timestamp as "approximate" and, where it matters, widen the clip boundary or flag the word for review. Test on audio that is deliberately full of numbers and names before you trust the timing in production, not after the first complaint.
How To Choose — Four Questions
The decision comes down to four questions, in order.
First, do you actually need word-level timing or speaker labels? If you only need a plain transcript of what was said, plain Whisper or faster-whisper is simpler and you can skip WhisperX entirely. WhisperX earns its extra two stages only when when or who matters.
Second, must the audio stay on your own servers? For medical, legal, or otherwise regulated recordings that cannot be sent to a third-party cloud, self-hosted WhisperX is often the only compliant option, and the operational cost is the price of that compliance.
Third, is the work batch or live? WhisperX is batch-only — it processes finished files. If you need word timing on a live stream, WhisperX is the wrong tool; you want the streaming options from the previous lesson or a custom streaming alignment setup.
Fourth, do you have the team and the volume? WhisperX rewards steady, high batch volume and an engineer who can run a GPU pipeline. At low or spiky volume, or with no ML capacity, a hosted API that gives timestamps and diarization behind one API key will be cheaper all-in and far faster to ship.
Figure 3. Four questions, asked in order, route most timestamp-and-speaker projects to the right tool.
Where Fora Soft Fits In
We build transcription features into the video products our clients ship, and the choice between WhisperX and a hosted API comes up constantly. We have used WhisperX where word-level timing and speaker labels had to be produced on the client's own infrastructure — searchable transcripts for OTT and surveillance archives, speaker-separated records for telemedicine consultations, and word-timed captions for e-learning libraries. The recurring lesson is that the model is the easy part. The hard part is the integration: extracting clean audio at the right sample rate, handling the alignment fallbacks so timestamps never silently lie, managing the Hugging Face token and gated models without breaking a deployment, and deciding honestly whether the client's volume justifies a GPU pipeline or a hosted API. We pick by the four questions above, not by which tool is fashionable.
What To Read Next
- Streaming ASR in production — Whisper, Deepgram, and AssemblyAI in 2026
- Speaker diarization with Pyannote in production
- Real-time multilingual speech translation in calls
Talk To Us / See Our Work / Download
- Talk to a video engineer about adding timestamped, speaker-labelled transcription to your product → /services/ai-software-development
- See our case studies in telemedicine, e-learning, OTT, and surveillance → /cases
- Download the WhisperX Pipeline Setup Checklist (one page, printable) → Download the checklist
References
- Bain, M., Huh, J., Han, T., Zisserman, A. WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. Interspeech 2023 (ISCA Archive, doi:10.21437/Interspeech.2023-78); arXiv:2303.00747.
https://www.isca-archive.org/interspeech_2023/bain23_interspeech.pdf. Primary source for the three-stage architecture (VAD pre-segmentation, Cut & Merge into ~30 s chunks, batched Whisper transcription, wav2vec2 forced alignment via dynamic time warping), the 11.8× / "twelve-fold" batched-inference speed measured on an NVIDIA A40, the drift problem in buffered Whisper, the nearest-neighbour-phoneme fallback for out-of-dictionary tokens, the default configuration (Table 1: large-v2 Whisper, wav2vec2 BASE 960H, pyannote VAD), and the Table 2 word-segmentation results at a 200 ms collar — WhisperX 93.2% precision / 65.4% recall on Switchboard and 84.1% / 60.3% on AMI, versus Whisper's 85.4% / 62.8% and 78.9% / 52.1% — plus the TED-LIUM (WER 9.7) and Kincaid46 (WER 11.8) long-form numbers. All timing-accuracy claims in this article trace to this table, read directly from the ISCA PDF. - Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision (Whisper). arXiv:2212.04356, 2022.
https://arxiv.org/abs/2212.04356. Primary source for Whisper's architecture (Transformer encoder-decoder), 680,000-hour training set, 30-second input window, 99-language coverage, MIT license, and the hallucination failure mode that WhisperX's VAD stage mitigates. - WhisperX GitHub repository (m-bain/whisperX), release v3.8.5, accessed 2026-05-31.
https://github.com/m-bain/whisperX. Source for the current default models (large-v2 showcased, faster-whisper backend), the up-to-70×-real-time figure, the <8 GB GPU memory requirement for large-v2 at beam_size 5, the--diarizeflag and Hugging Face token requirement, the wav2vec2 default alignment models and the language coverage of default aligners, the assign-word-to-overlapping-speaker logic, the BSD-2-Clause license, and the documented limits on alignment of non-dictionary tokens and on diarization of overlapping speech. - wav2vec 2.0 — Baevski, A., Zhou, H., Mohamed, A., Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv:2006.11477, 2020.
https://arxiv.org/abs/2006.11477. Primary source for the self-supervised speech-representation model that WhisperX's forced-alignment stage uses to score phonemes. - Bredin, H. et al. pyannote.audio: neural building blocks for speaker diarization. ICASSP 2020; project at
https://github.com/pyannote/pyannote-audio, accessed 2026-05-31. Primary source for the diarization library WhisperX calls in stage four, its MIT-licensed code, and its gated pretrained model weights that require a Hugging Face token. - CTC and forced alignment — Graves, A., Fernández, S., Gomez, F., Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006.
https://www.cs.toronto.edu/~graves/icml_2006.pdf. The connectionist-temporal-classification formulation underlying the phoneme alignment WhisperX performs; standards-equivalent algorithmic source for the dynamic-programming alignment search. - TorchAudio Forced Alignment Tutorial. PyTorch documentation, accessed 2026-05-31.
https://pytorch.org/audio/stable/tutorials/forced_alignment_tutorial.html. The reference implementation of the trellis-and-backtrack (Viterbi) forced-alignment procedure that WhisperX's alignment stage adapts, and the source of its default wav2vec2 torchaudio bundles. - Modal. Choosing Whisper Variants for Production Transcription. modal.com/blog, 2026, accessed 2026-05-31.
https://modal.com/blog/choosing-whisper-variants. Secondary source for the practitioner positioning of WhisperX as the go-to open-source option when word-level timestamps and speaker labels matter. No timing-accuracy number in this article relies on it — all such figures now come from the paper's Table 2 (reference 1). Vendor-blog tier. - Towards AI. Whisper Variants Comparison: Features and Implementation. towardsai.net, 2026, accessed 2026-05-31.
https://towardsai.net/p/machine-learning/whisper-variants-comparison-what-are-their-features-and-how-to-implement-them. Secondary source for the practical observation that wav2vec2 alignment is less noise-robust than Whisper transcription and degrades faster on noisy audio.


