Published 2026-06-02 · 20 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

If your product captures speech — a meeting tool, a dictation feature, a language app, a voice-controlled interface, a kiosk — you face one architectural fork before you write any code: does the speech recognition run on your servers, or on the user's device? Running it on the device promises three things product teams want badly: the audio stays private because it never travels, the feature keeps working with no internet, and you pay no per-minute cloud bill no matter how much people talk. This article is for the product manager, founder, or engineering lead who has heard "we can just run Whisper in the browser" and needs to know whether that is true, what it really takes, and where it quietly falls down. It is the companion to our article on the server-side captions pattern; that one argues for transcribing once on the server for group calls, and this one maps the opposite path — when the device is the right place to listen.

First, Untangle The Name

The phrase "faster-whisper-WASM in browser" is doing too much work, and pulling it apart is the fastest way to understand the whole field. Three different things hide inside it, and confusing them is the most common mistake in this topic.

Start with the model. Whisper is a speech-to-text model released by OpenAI in 2022 — software trained to listen to audio and write down the words. It comes in sizes, from a "tiny" version with 39 million internal parameters up to a "large" version with about 1.5 billion, where more parameters mean more accuracy but a bigger, slower model. The original Whisper is a research release written in Python; on its own it is not built for a browser or for speed.

Now the confusing part. "Faster-whisper" is a specific, separate project — a reimplementation of Whisper built on a fast inference engine called CTranslate2 that runs the same model several times quicker while using less memory. But faster-whisper runs on a server or a desktop, in Python, often on a graphics card. It does not run inside a web browser, and there is no official "faster-whisper-WASM." When people say that phrase, they almost always mean one of the genuine browser engines below, and have borrowed the well-known "faster-whisper" name as a stand-in for "a fast Whisper." Getting this straight matters, because if you go looking for a faster-whisper browser build you will not find one, and you will waste a week.

So what actually runs in a browser? Two engines do the real work. The first is whisper.cpp — a rewrite of Whisper in the C and C++ programming languages with no outside dependencies, which can be compiled into WebAssembly (abbreviated Wasm), the format that lets near-native compiled code run inside a browser. The second is Transformers.js — a JavaScript library from Hugging Face that runs Whisper through a browser engine called ONNX Runtime Web and can use the device's graphics card through a new standard called WebGPU. A third, lighter option, Moonshine, is a newer model purpose-built for on-device live transcription. Those three — not "faster-whisper" — are what you reach for when the goal is recognition in the browser.

A landscape comparison on white titled "Four Things People Mean By 'Whisper In The Browser'". Four labelled cards in a row. Card 1 "faster-whisper" tagged in orange "server / desktop only — NOT in a browser", note "CTranslate2 engine, Python, GPU". Card 2 "whisper.cpp + WebAssembly" tagged green "runs in the browser", note "C/C++ compiled to Wasm, CPU, SIMD". Card 3 "Transformers.js + WebGPU" tagged green "runs in the browser", note "ONNX Runtime Web, uses the GPU". Card 4 "Web Speech API" tagged blue "built into the browser", note "but Chrome sends audio to Google's servers". A divider separates Card 1 from the rest. Footer: Fora Soft · www.forasoft.com. Figure 1. The single most useful distinction in this topic: "faster-whisper" is a server-side library, while the things that truly run in a browser are whisper.cpp (WebAssembly), Transformers.js (WebGPU), and the built-in Web Speech API.

Why Not Just Use The Browser's Built-In Speech Recognition?

Before downloading any model, a fair question is whether the browser already does this for free. It nearly does. There is a built-in browser feature called the Web Speech API, and within it an interface named SpeechRecognition that turns microphone speech into text with a few lines of JavaScript. If it solved the problem, this article would be one paragraph.

It does not solve the problem for two reasons. The first is privacy, and it is the big one. In Chrome — the main browser that supports this feature — the default speech recognition is not done on your device at all. The browser records your microphone and sends the audio to a Google web service, which transcribes it and sends text back. So the one feature people reach for to keep audio on the device is, by default, doing the exact opposite: shipping the audio to a third party and requiring an internet connection. That defeats the privacy and offline goals that make client-side recognition attractive in the first place.

The second reason is support. Only Chrome and other Chromium-based browsers such as Edge implement SpeechRecognition; Safari and Firefox have long been inconsistent or absent. A feature that works in one browser family and silently fails in others is not something you can ship to all your users.

Newer versions of the standard add a switch to force recognition to happen locally on the device, and browsers are slowly adding on-device modes. But you do not control which browser or version your user has, and you cannot promise a customer that audio stays on the device when the underlying engine might send it to a server. When privacy or offline operation is the actual requirement, you need an engine you ship and control — which means downloading a model and running it yourself with whisper.cpp or Transformers.js. The rest of this article is about doing that well.

The In-Browser Pipeline, Step By Step

Putting Whisper in a web page is not one action; it is a short assembly line. Walking it from microphone to caption shows where the real work and the real costs sit.

It begins with capturing audio. The browser asks the user for microphone permission through a standard call named getUserMedia, which hands back a live audio stream. That stream is at whatever sample rate the device uses, commonly 44,100 or 48,000 samples per second, but Whisper expects audio at exactly 16,000 samples per second in a single channel. So the next step resamples and downmixes the audio — a small piece of code, usually running in an audio worklet (a background audio processor that does not freeze the page), turns the microphone feed into the 16-kilohertz mono stream the model wants.

Next comes the gate, and it is the single most important efficiency trick. Speech recognition is expensive to run; silence is not worth transcribing. A small, cheap detector called Voice Activity Detection — VAD, software that answers the yes/no question "is anyone speaking right now?" — sits in front of the model. The common browser choice is a compact model called Silero VAD, run through the same ONNX Runtime Web engine, which scores each short slice of audio from 0 to 1 for the chance it contains speech. Only audio that clears the bar gets passed on. Without this gate, your engine grinds on every second of background hum and burns battery for nothing.

Then the model itself runs. The gated 16-kilohertz audio is fed to Whisper — executing either as WebAssembly through whisper.cpp or on the graphics card through Transformers.js — and the model returns text. That text is displayed, and for a live experience it is shown in two stages: a fast rough guess while the person is still talking, then a settled final line once they pause. We cover that "partial then final" behaviour in depth in the live captions article; it works the same way whether recognition runs on a server or, as here, on the device.

The whole chain — microphone, resample, VAD gate, model, display — lives entirely inside the browser tab. No audio leaves. That is the property you are buying, and the pipeline is the price.

A left-to-right pipeline on white titled "The In-Browser ASR Pipeline". Boxes connected by arrows: "Microphone" → "getUserMedia (live audio stream)" → "Resample to 16 kHz mono (audio worklet)" → a diamond "Silero VAD gate — speaking?" with a 'yes' arrow to "Whisper engine (whisper.cpp Wasm OR Transformers.js WebGPU)" → "Caption text on screen — partial then final". A dashed box around the whole row labelled "all of this runs inside the browser tab — no audio leaves the device". Below the model box a small note: "model file downloaded once, then cached (31–182 MB)". Footer: Fora Soft · www.forasoft.com. Figure 2. Every stage from microphone to caption runs inside the browser tab. The VAD gate before the model is what keeps battery use sane; the model file is a one-time download that is then cached.

The Two Engines: WebAssembly Versus WebGPU

The model has to execute on the device's hardware somehow, and there are two routes. They differ enough that the choice shapes your whole feature, so it is worth understanding both plainly.

The first route is WebAssembly, used by whisper.cpp. WebAssembly is a way to take code written in fast, compiled languages like C and run it inside the browser at close to native speed. It executes on the device's main processor, the CPU. To go faster it leans on two extras: SIMD, which lets one instruction process several numbers at once, and threads, which spread work across several CPU cores at the same time. WebAssembly is widely supported and works without a graphics card, which makes it the dependable floor — but it is bounded by CPU speed, so on a large model it can struggle to keep up with live speech.

The second route is WebGPU, used by Transformers.js. WebGPU is a new web standard, developed by the same group that standardizes the web, that lets a web page use the device's graphics card — the GPU — for general number-crunching, not just drawing graphics. Because a GPU is built to do thousands of small calculations in parallel, it suits the matrix math inside a speech model far better than a CPU. Hugging Face measured WebGPU running their models up to one hundred times faster than the WebAssembly path for the same work. That headline number is workload-dependent and will not hold for every model, but the direction is real: when a GPU is available, it transforms what model size you can run live in a browser.

There is a catch with WebGPU: it is newer, so it is not everywhere. It shipped on by default in Chrome and Edge in 2023, reached Apple's Safari and Mozilla's Firefox through 2025, and by late 2025 ran by default across all the major browsers — but an older browser or a low-end device may not have it. A robust product therefore treats WebGPU as the fast path and WebAssembly as the fallback: use the graphics card when it is there, drop to the CPU when it is not.

There is one more rule that trips up teams using the WebAssembly route specifically. To use threads — running on several CPU cores — WebAssembly needs a shared block of memory, and browsers only allow that when the page turns on a security mode called cross-origin isolation. In practice that means your server must send two specific HTTP headers (named, for the record, Cross-Origin-Opener-Policy and Cross-Origin-Embedder-Policy) with every page. Forget them and the multi-core path silently switches off, your transcription runs on a single core, and it crawls. This is a configuration line, not a code change, and it is the most common reason an in-browser Whisper demo that was fast on the developer's machine is slow in production.

A two-column comparison on white titled "Two Ways To Run The Model: WebAssembly vs WebGPU". Left column "WebAssembly (whisper.cpp)" in blue: runs on the CPU; speed boosters "SIMD + threads"; "needs cross-origin isolation (COOP + COEP headers) for threads"; "works almost everywhere — the dependable fallback". Right column "WebGPU (Transformers.js)" in purple: runs on the GPU; "up to ~100x faster than Wasm for the same work (Hugging Face)"; "default in all major browsers by late 2025"; "older / low-end devices may lack it". A bar beneath shows a short blue bar 'CPU / Wasm' and a long purple bar 'GPU / WebGPU' with caption "relative speed, illustrative". Footer: Fora Soft · www.forasoft.com. Figure 3. The CPU path (WebAssembly) is the dependable fallback that runs almost everywhere; the GPU path (WebGPU) is far faster but newer. The cross-origin isolation headers are the easy-to-miss requirement for the WebAssembly multi-core path.

What It Actually Costs: Download, Memory, And Battery

Client-side recognition has no per-minute cloud bill, and that is its headline advantage. But "free" is the wrong word, because the cost simply moves onto the user's device in three forms: download, memory, and battery. Naming the numbers keeps the decision honest.

The first cost is the model download. The model is a file the browser must fetch the first time, then it is cached for later visits. In whisper.cpp's format the sizes are concrete: the "tiny" model is about 75 megabytes, "base" about 142, and "small" about 466. Squeezing the model with a technique called quantization — storing its numbers with fewer bits to shrink the file at a small accuracy cost — brings those down to roughly 31, 57, and 182 megabytes. To put that in human terms, a quantized "base" model at 57 megabytes is a download in the range of a short video clip — fine on home wifi, noticeable on a phone with a weak signal, and a real consideration if your user is on a metered mobile plan.

The second cost is memory. The model has to sit in the device's working memory while it runs. As a rough guide, the "tiny" model needs a few hundred megabytes of memory at runtime, and the larger sizes need proportionally more. On a modern laptop this is unremarkable; on a budget phone with several other apps open it can be the difference between smooth and stuttering.

The third cost is battery and heat, paid only while the model is actually running. This is exactly why the VAD gate from the pipeline matters so much: by transcribing only speech and skipping silence, you cut the model's running time — and therefore its battery draw — to a fraction. A dictation feature that runs the model only while the user holds a button costs almost nothing; a always-listening transcriber with no gate will warm a phone in the user's hand.

Here is the trade laid out as a single decision: a bigger model is more accurate but downloads slower, uses more memory, and drains more battery. For most live, on-device features the sweet spot is a "tiny" or "base" model, quantized, gated by VAD, with WebGPU used when available. That combination keeps the download under a hundred megabytes, the memory modest, and the battery cost tied to how much people actually speak.

A Numeric Example: Does The Device Keep Up With Live Speech?

The question that decides whether live, on-device transcription is even possible is simple to state: can the model transcribe a second of audio in less than a second? If yes, it keeps up with live speech; if no, it falls behind and the captions lag further with every sentence. Engineers measure this with the real-time factor — the time the model takes to process audio divided by the length of that audio. Below 1.0 means faster than real time; above 1.0 means it cannot keep up.

Work an example. Suppose on a given mid-range laptop the "base" Whisper model, running on the CPU through WebAssembly, takes 1.2 seconds to transcribe each 1 second of audio:

real-time factor = processing time ÷ audio length
real-time factor = 1.2 s ÷ 1.0 s
real-time factor = 1.2   → above 1.0, so it cannot keep up live

At 1.2 it falls behind — fine for transcribing a finished recording, too slow for live captions. Now switch the same model to the GPU through WebGPU and suppose it processes that same 1 second of audio in 0.25 seconds:

real-time factor = 0.25 s ÷ 1.0 s
real-time factor = 0.25   → well below 1.0, comfortably real time

At 0.25 it has headroom to spare. That gap — from too slow to comfortable — is the practical reason WebGPU changed what is possible in the browser. The exact numbers depend on the device, the model size, and the audio, so you must measure on the hardware your users actually have rather than trust a benchmark from a developer's high-end machine. But the shape of the answer is reliable: on the CPU, stick to the smallest models for live work; with the GPU, you can run a larger, more accurate model and still keep up.

Whisper Was Not Built For Live — Which Is Why Moonshine Exists

There is a structural reason live transcription with Whisper is awkward, and a newer model that fixes it is worth knowing about. Whisper was designed to process audio in fixed thirty-second chunks. Feed it three seconds of speech and it still pads the input out to thirty seconds internally and does the work of a full window — so for short, live snippets it does a lot of wasted computation. For transcribing a podcast that is fine; for captioning a live conversation in small pieces it is inefficient.

A 2024 model called Moonshine, from a company named Useful Sensors, was built specifically to fix this for live use. Instead of always working in thirty-second windows, Moonshine scales its processing to the actual length of the audio it is given — three seconds of speech costs three seconds of work, not thirty. Its makers report it running about five times faster than the comparable Whisper "tiny" model for a short clip, with no loss of accuracy, and it is small enough to run on phones and even a Raspberry Pi. For a product whose whole job is live, on-device transcription of short utterances — voice commands, live captions, dictation — Moonshine is increasingly the better-fitted engine, and it runs through the same browser machinery as Whisper.

The takeaway is not "always use Moonshine." Whisper remains the more battle-tested, more broadly supported choice, with wider language coverage and a larger ecosystem. The point is that the model is a swappable part of the pipeline in Figure 2, and matching the model to the job — Whisper for general transcription, Moonshine for tight live loops — is part of doing this well.

Client Or Server? The Decision That Actually Matters

All of the above serves one product decision: should the speech recognition run on the device, or should you send the audio to a server and transcribe it there? Both are valid; the right answer depends on the job, and the two articles in this pair argue the two sides. Here is the honest comparison.

Criterion Client-side ASR (in the browser) Server-side ASR (send audio up)
Audio privacy Strong — audio never leaves the device Weaker — audio reaches your server or a vendor
Works offline Yes — no connection needed after model download No — needs a live connection
Per-minute cloud cost None — runs on the user's hardware Yes — billed per minute of audio
Upfront cost to the user Model download (31–182 MB) + battery None — device only sends audio
Accuracy ceiling Bounded by what a small model runs live on-device High — can run the largest models on a GPU server
Works on a weak phone Sometimes — limited by memory and CPU/GPU Yes — device only records and uploads
Consistency across users Varies by device capability Identical — one engine for everyone
Best fit 1:1 calls, dictation, offline, privacy-locked, kiosks Group calls, recording, search, highest accuracy

Read it by your product. If you are building a privacy-first dictation app, a field tool that must work with no signal, a kiosk, or a one-to-one call where audio should never touch a server, the client side is the right home for recognition, and the costs — model download, battery, a smaller model's lower accuracy — are worth paying. If you are building group meetings you also record, search, and translate, or you need the accuracy of the largest models, the server is the right home, and our server-side fan-out article is the pattern to follow. Many mature products do both: run a small model on the client for instant, private feedback, and send audio to a server when the user opts into higher-accuracy features. The pipeline in this article is the client half of that design.

A two-panel decision diagram on white titled "Where Should Recognition Run?". Left panel green, "Run on the CLIENT when:" with rows: "audio must stay private", "must work offline", "1:1 call or single user", "no per-minute budget", "kiosk / field device". Right panel blue, "Run on the SERVER when:" with rows: "group call (many speakers)", "you record / search / translate", "you need top accuracy", "weak or varied user devices", "one consistent engine for all". A center note: "many products do both — small model on the client, big model on the server." Footer: Fora Soft · www.forasoft.com. Figure 4. The fork this whole article serves. Client-side recognition wins on privacy, offline, and zero per-minute cost; server-side wins on accuracy, consistency, and weak devices. Mature products often run both.

A Common Mistake: Shipping The Biggest Model You Can Find

The most frequent and most damaging error in client-side ASR is choosing the model by accuracy alone and reaching for a large one, because it scores best in a quiet test. It works beautifully on the developer's fast laptop with a finished audio file, and then it ships.

On a real user's mid-range phone the large model is a disaster in three ways at once. It is a huge download — hundreds of megabytes to a gigabyte — that many users will abandon before it finishes. It demands more memory than a budget device can spare, so it stutters or crashes. And it runs slower than real time on a CPU, so the live captions it was supposed to power fall further behind with every sentence. The feature that demoed perfectly is unusable for the people you built it for.

The fix is to size the model to the worst device you intend to support, not the best, and to recover accuracy through the pipeline rather than raw model size. Use a small, quantized model; gate it with VAD so it only runs on speech; prefer WebGPU so the GPU does the work; clean the incoming audio with noise suppression so the small model hears words clearly; and offer a server-side path for users who explicitly want maximum accuracy. The discipline is the same one we apply across small on-device models: the right model is the largest one that comfortably runs on your floor device, not the most accurate one in a lab.

Where Fora Soft Fits In

We build the live-video and voice products where this client-versus-server choice comes up constantly — video conferencing, telemedicine consultations, e-learning, and field and kiosk applications — and we make the choice per feature rather than once for the whole product. When a customer needs audio to stay on the device for privacy or to keep working without a connection, we run recognition in the browser with the pipeline this article describes: capture at 16 kilohertz, gate with voice-activity detection, run a small quantized model on WebGPU with a WebAssembly fallback, and serve the cross-origin isolation headers so the multi-core path actually engages. When the job is a recorded group meeting that also needs search and translation, we transcribe on the server instead. Because we work in healthcare and education, where privacy and offline operation are often hard requirements rather than nice extras, we treat client-side recognition as a first-class option and size the model to the real devices our customers' users carry.

What To Read Next

Talk To Us / See Our Work / Download

  • Talk to a video engineer about on-device or hybrid speech recognition in your product → /services/webrtc-development
  • See our case studies in conferencing, telemedicine, and e-learning → /cases
  • Download the Client-Side ASR Engineering Cheat Sheet (one page, printable) → Download the cheat sheet

References

  1. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision (Whisper), OpenAI, arXiv:2212.04356, December 2022, accessed 2026-06-02. https://arxiv.org/abs/2212.04356. Primary algorithmic source for the Whisper model: encoder-decoder Transformer trained on 680,000 hours of weakly supervised audio, the tiny→large parameter ladder (39M to 1.5B), and the fixed 30-second processing window that motivates Moonshine. The paper is itself the primary source for Whisper's design; no vendor blog overrides it.
  2. W3C GPU for the Web Working Group. WebGPU, W3C Candidate Recommendation, accessed 2026-06-02. https://www.w3.org/TR/webgpu/. Primary standards source for the GPU-compute API used by Transformers.js to accelerate the model. Defines WebGPU as the successor to WebGL with general-purpose compute. Browser ship status (Chrome/Edge 113, Safari/Firefox through 2025) cited from the W3C implementation status wiki and web.dev, below.
  3. W3C WebAssembly Working Group. WebAssembly Core Specification, Level 2.0, W3C Recommendation, accessed 2026-06-02. https://www.w3.org/TR/wasm-core-2/. Primary standards source for WebAssembly — the compiled-code format that whisper.cpp targets — including the SIMD vector instructions and the shared-memory model that threads depend on.
  4. WICG / Web Speech API Community Group. Web Speech API, W3C Community Group Draft Report, accessed 2026-06-02. https://webaudio.github.io/web-speech-api/. Primary standards source for the built-in SpeechRecognition interface, including the local-processing controls added to the draft. Used for the "why not use the built-in API" section; paired with MDN below for the Chrome-sends-to-Google default behaviour.
  5. MDN Web Docs (Mozilla). Using the Web Speech API and SpeechRecognition, accessed 2026-06-02. https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API/Using_the_Web_Speech_API. Reference source documenting that Chrome's default Web Speech recognition is server-based (audio sent to a Google web service, requires connectivity) and that SpeechRecognition support is limited to Chromium-based browsers. Browser-behaviour reference, not a codec claim.
  6. web.dev (Google / Chrome). Making your website "cross-origin isolated" using COOP and COEP and WebGPU is now supported in major browsers, accessed 2026-06-02. https://web.dev/articles/coop-coep. First-party engineering source for the cross-origin isolation requirement (Cross-Origin-Opener-Policy + Cross-Origin-Embedder-Policy enabling SharedArrayBuffer and WebAssembly threads) and for the late-2025 cross-browser WebGPU ship status. Derived from the HTML and Fetch living standards.
  7. ggml-org. whisper.cpp — Port of OpenAI's Whisper model in C/C++ (repository, models/README.md, and WASM examples), accessed 2026-06-02. https://github.com/ggml-org/whisper.cpp. Reference-implementation source for the WebAssembly browser engine: the ggml model format and sizes (tiny 75 MB, base 142 MB, small 466 MB unquantized; 31/57/182 MB at Q5_1), the WASM-SIMD requirement, and the Firefox 256 MB file-load limit.
  8. Hugging Face (Joshua Lochner / Xenova). Transformers.js v3: WebGPU Support, New Models & Tasks, and More…, 22 October 2024, accessed 2026-06-02. https://huggingface.co/blog/transformersjs-v3. First-party source for the WebGPU browser engine: the "up to 100x faster than WASM" measurement, the device: 'webgpu' Whisper example, per-module quantization (the Whisper encoder's sensitivity to quantization), and ONNX Runtime Web as the execution backend. The v4 release (9 February 2026) is noted as the current line.
  9. Jeffares, N., et al. (Useful Sensors). Moonshine: Speech Recognition for Live Transcription and Voice Commands, arXiv:2410.15608, October 2024, accessed 2026-06-02. https://arxiv.org/abs/2410.15608. Primary source for the Moonshine model: variable-length processing (no fixed 30-second padding), ~5x compute reduction versus Whisper tiny.en on a 10-second segment with no WER increase, parameter counts (tiny 27.1M, base 61.5M), and the live-transcription / voice-command design target.
  10. ricky0123. Voice Activity Detection for the browser (@ricky0123/vad-web) and snakers4. Silero VAD, accessed 2026-06-02. https://github.com/ricky0123/vad. First-party source for the in-browser VAD gate: Silero VAD run through ONNX Runtime Web, producing a 0–1 speech probability per audio frame with speech-start/speech-end callbacks — the gate that keeps the model from transcribing silence.
  11. SYSTRAN. faster-whisper (repository), accessed 2026-06-02. https://github.com/SYSTRAN/faster-whisper. First-party source establishing that faster-whisper is a CTranslate2-based reimplementation of Whisper for server/desktop CPU and GPU inference — not a browser/WebAssembly build — the core clarification of the "untangle the name" section.
  12. MDN Web Docs (Mozilla). MediaDevices.getUserMedia() and AudioWorklet, accessed 2026-06-02. https://developer.mozilla.org/en-US/docs/Web/API/MediaDevices/getUserMedia. Reference source for the audio-capture front of the pipeline: requesting the microphone stream with getUserMedia and resampling/downmixing to 16 kHz mono in an audio worklet without blocking the main thread.