
Key takeaways
• Voice command in meetings is now agentic, not push-to-talk. The 2026 stack is wake-word on-device, streaming STT in the cloud, an LLM-style intent layer, and instant action dispatch — not a single “Hey Zoom” button.
• End-to-end latency under 500 ms is the bar. Anything slower and users abandon the feature within a week. Most failures are STT, not models.
• Twelve features cover 90% of real demand. Self mute / unmute, raise hand, mute all, recording start/stop, summary on demand, translate, schedule follow-up, set timer, action items, screen share, noise suppression toggle, “catch me up.”
• Use a hybrid SDK stack — not a single vendor. Picovoice or Vosk for wake word on-device, Deepgram or AssemblyAI for STT, OpenAI Realtime or LiveKit Agents for the agent loop, your meeting platform for actions.
• Privacy is the silent feature. On-device wake words, opt-in recording with explicit disclosure to all participants, GDPR/HIPAA-grade data residency — or you don’t close the enterprise deal.
Why Fora Soft wrote this guide
Fora Soft has shipped video conferencing products and AI integrations since the WebRTC era began. Real-time voice and video are our default surface area — with platforms like TransLinguist (live multilingual interpretation), Meetric (AI sales-call assistant), VOLO (real-time translation infrastructure) and Nucleus (on-premise comms for regulated industries).
This guide is for the founder, head of product or platform engineer adding a voice-command layer to a meeting product — or evaluating which third-party assistant to embed. We’ll skip vendor marketing and focus on what wins or kills the feature in production: latency, accuracy on real accents and noise, and the privacy posture that gets you through procurement.
Adding voice command to a meeting product?
30 minutes with a Fora Soft architect — we’ll size the latency budget, pick the SDK stack and outline the privacy story for your customer.
The 2026 voice-command stack, in five shifts
1. Always-listening replaced push-to-talk. A small on-device model watches for a wake word; only after detection does audio leave the device. Press-and-hold buttons feel old.
2. Streaming STT got fast enough. Deepgram, AssemblyAI Real-Time and OpenAI Realtime all clear the <500 ms partial-transcript bar in 2026, which is the threshold for usable voice control.
3. The intent layer became an LLM agent. Instead of a regex parser mapping “mute” to a function, an LLM with function-calling and meeting context handles ambiguity (“mute everyone except Carla”) reliably.
4. Voice agents are conversational, not one-shot. “Catch me up on the last three minutes — what did engineering decide?” works because the agent holds the running transcript and meeting state.
5. Privacy got non-negotiable. SOC 2, GDPR data residency, recording disclosure to all participants, and on-device wake-word detection are now must-haves. Vendors without them lose enterprise.
The twelve voice features that actually get used
Most products ship 30 commands and 5 get used. These are the ones that survive in our customer telemetry.
| # | Feature | Why it matters | Latency target |
|---|---|---|---|
| 1 | Mute / unmute self | Most-used; accessibility default | <300 ms |
| 2 | Mute all / unmute all | Host-only; large meetings | <400 ms |
| 3 | Raise / lower hand | Replaces fumbling for UI | <400 ms |
| 4 | Start / stop recording | Compliance-critical; consent banner triggers | <500 ms |
| 5 | Share / stop sharing screen | Hands-on demos, fewer misclicks | <700 ms |
| 6 | Summarise the last N minutes | Catch up after distraction | 2–4 s acceptable |
| 7 | Generate action items | Post-meeting workflow trigger | 2–5 s acceptable |
| 8 | Translate captions | Multilingual meetings | <1.2 s |
| 9 | Set timer / time-box | Standups, agile rituals | <500 ms |
| 10 | Schedule follow-up | Calendar integration; closes the loop | 1–2 s acceptable |
| 11 | Toggle noise suppression | On-the-fly audio cleanup | <500 ms |
| 12 | Switch camera / device | Multi-device hosts | <700 ms |
Reach for voice features 1–5 in your MVP: they cover 80% of usage, share the same wake-word + STT plumbing, and don’t need an LLM agent on the critical path.
The reference architecture — four layers, where each runs
A voice-command pipeline has four well-defined layers. Where each one runs is the privacy & latency design.
Layer 1 — Wake-word detector. On-device. Picovoice Porcupine, Vosk, or a custom small model. Watches the mic for a short trigger word (“Hey Meet”) and only when it fires does anything else activate. Critical for privacy: no audio leaves the device until intent.
Layer 2 — Streaming STT. Cloud (or on-prem if regulated). Deepgram or AssemblyAI for English-heavy markets, Google or Azure for broadest language coverage. Stream partial transcripts back as audio arrives; aim for <200 ms first-token latency. More on STT for live streaming.
Layer 3 — Intent / command parser. Hybrid. Simple commands (“mute me”) hit a deterministic regex/keyword router on the client — near-zero latency. Complex utterances (“summarise the last three minutes”) escalate to an LLM agent (OpenAI Realtime, Claude, LiveKit Agents).
Layer 4 — Action dispatch. Local or your meeting backend. Mute, raise hand, recording, share screen are local UI calls; summarise, schedule, action items hit your meeting backend, agent or third-party APIs (Calendar, Slack).
For a deeper dive, see how AI agents plug into WebRTC pipelines.
The latency budget that keeps the feature alive
Users abandon voice commands above ~500 ms end-to-end. Below the budget. Negative values are aspirational.
| Stage | Budget (p95) | What burns it |
|---|---|---|
| Wake-word detection | <50 ms | Mic buffering, weak on-device CPU |
| Audio capture & buffering | <50 ms | High frame size, OS scheduling |
| Streaming STT first partial | <200 ms | Cloud region, batch size, codec |
| Intent decision | <100 ms (regex) / 300–800 ms (LLM) | Model size, prompt length, function-calling round trips |
| Action dispatch | <100 ms | Server round-trip, slow UI updates |
| Total (simple command) | <500 ms | All of the above stacked |
The eight SDKs we actually build with in 2026
Pricing is sketched at 2026 list rates and shifts often — verify before procurement. Latency tiers are stable.
Deepgram — the latency king for English-heavy STT
Real-time streaming STT with first-partial latency consistently under 200 ms. Excellent punctuation, capitalization and custom vocabulary — useful for meeting jargon. Pay-as-you-go on streaming minutes; SDK / on-prem available for regulated workloads. Best default for the STT layer in a meeting product.
AssemblyAI — STT plus speaker, sentiment, entities
Strong on speaker diarisation, sentiment, entity extraction and 100+ languages. Realtime tier hits sub-500 ms. Use when you want a single API for transcription plus downstream meeting analytics, not just commands.
OpenAI Realtime API — voice agent in a single socket
Bidirectional voice with sub-1 s round-trip and built-in function-calling. Reach for it when the agent is conversational (“catch me up,” “summarise then schedule a follow-up”). Pricier per minute; the right tool for premium tiers, not for “mute me.”
Google Cloud Speech-to-Text — broadest language coverage
120+ languages and dialects, mature data-residency options, deep Google Workspace integration. The pragmatic choice if your audience is global and English-only isn’t enough. Latency varies by region.
Azure AI Speech — the enterprise & healthcare default
HIPAA-eligible, FedRAMP options, native Teams ecosystem hooks. If your buyer is a US health system or a Microsoft-365 enterprise, Azure Speech is the path of least resistance. Latency is competitive on Azure-region traffic.
Picovoice (Porcupine + Cobra) — on-device wake word done right
The privacy story you can show procurement. Wake-word detection runs entirely on-device; no audio reaches the cloud until the trigger fires. Custom wake-word training with a few audio samples. SDK pricing scales with users; plan it into unit economics.
Reach for Picovoice when: your buyer scrutinises data flow on every meeting (legal, healthcare, finance), or your product runs in low-bandwidth environments where you can’t afford to stream every utterance.
LiveKit Agents — the open-source voice agent framework
Plugs voice agents into a WebRTC SFU you may already be running. Open source, hostable on your own infra, integrates with any STT/LLM/TTS stack you choose. Reach for it when you need control over latency, model choice, and privacy. See our LiveKit multimodal-agent build guide.
Reach for LiveKit Agents when: you already run a custom WebRTC stack (or are moving off Agora to a self-managed SFU), and total control of model choice and data path is worth the operational burden.
Krisp — noise suppression and lightweight voice control
Best-in-class background-noise removal, plus increasingly capable on-device meeting transcription and basic voice features. Often paired with one of the big STT vendors rather than replacing them.
Five ready-made AI meeting assistants worth integrating
If your goal is “don’t reinvent transcription and summary,” these are the assistants we see most often. None of them controls your meeting UI directly; they generate transcripts, summaries and action items asynchronously.
Otter.ai. The mainstream choice for sales and operations teams. Strong web app, decent integrations, free tier exists. Weakness: speaker accuracy on noisy calls.
Fireflies.ai. CRM-flavoured. Excellent integrations with HubSpot, Salesforce, Slack. Pulls action items into the right places automatically.
Read.ai. Adds engagement metrics — sentiment, talk time, attentiveness. Useful for sales coaching and meeting hygiene.
Fathom. Free tier is genuinely usable for individuals, simple UX, fast adoption.
Avoma. The enterprise pick when you want playbooks, scorecards and revenue-team workflows on top of transcripts.
The comparison matrix — latency, languages, fit
| Tool | Layer | Latency tier | Languages | Privacy posture | Best fit |
|---|---|---|---|---|---|
| Deepgram | STT | <200 ms | ~40 | Cloud + on-prem | English-heavy realtime |
| AssemblyAI | STT + analytics | <500 ms realtime | 100+ | Cloud (GDPR DPAs) | Multilingual transcripts |
| OpenAI Realtime | Voice agent | ~1 s round-trip | Multi | Cloud (zero-retention option) | Conversational meeting agent |
| Google STT | STT | ~500 ms | 120+ | Cloud, regional | Global, multilingual |
| Azure AI Speech | STT + TTS | ~500 ms | 100+ | HIPAA, FedRAMP | Healthcare, M365 stacks |
| Picovoice | Wake word | <50 ms on-device | Multi (custom) | 100% on-device | Privacy-first products |
| LiveKit Agents | Agent framework | Depends on stack | Multi | Self-host capable | Custom WebRTC builds |
| Krisp | Noise + light voice | On-device | Multi | On-device, GDPR | Audio cleanup add-on |
Need a voice-command stack picked for your product?
We’ll match latency budget, language coverage and privacy posture to your buyer in a 30-min architecture review.
Privacy and compliance — the silent feature
Voice features fail procurement long before they fail user testing. Five rules cover the high ground.
1. Wake word stays on-device. No audio leaves the participant’s machine until trigger detection. State this in the privacy policy explicitly.
2. Recording requires unambiguous opt-in. A consent banner that names the action (“This meeting is being recorded”), shows in every participant’s client, and logs the consent. GDPR Article 6, HIPAA, and most US state two-party consent laws require this.
3. Data residency matters. EU data through EU regions, KSA data through KSA, healthcare through HIPAA-eligible regions. Pick STT and LLM providers that document residency.
4. Zero-retention options on day one. Major STT/LLM vendors (OpenAI, Anthropic, Deepgram) all offer zero-retention modes; turn them on. Don’t train on user data unless you have explicit consent.
5. SOC 2 + GDPR DPA + HIPAA BAA where relevant. If you sell to enterprises, plan for SOC 2 Type II within 12 months. More on the NFR side here.
Build vs buy — what shipping a credible voice layer actually takes
If your meeting platform already has WebRTC and a backend, a credible voice-command MVP — wake word, streaming STT, deterministic command parser for the top 5–7 commands, opt-in recording — ships in 6–10 weeks with a small senior team using our Agent-Engineering workflow.
Adding the LLM agent layer — conversational summaries, “catch me up,” multi-turn requests — adds another 4–8 weeks. Add 2–3 weeks per regulated framework (HIPAA, PCI, FedRAMP).
Buy when your differentiator is the meeting itself, not the assistant. Embedding an AI call assistant via API can take less than a sprint. Build when voice is the product (a voice-first sales tool, a multilingual interpretation platform like TransLinguist) or when regulated data prevents third-party SaaS.
Five pitfalls that kill voice-command features
1. False triggers. The wake word fires during normal speech, the command parser dispatches an action, the user is muted at the wrong moment. Mitigation: a two-stage trigger (wake word + intent confidence), short audio cooldown, and visible confirmation before destructive actions.
2. Accent and language drift. Models trained on US English fail on Indian, Nigerian or Singaporean accents. Test with the cohort that will actually use the product, not the team who built it.
3. Multi-participant collisions. Two people say “mute me” at the same time on different streams. Mitigation: per-stream wake-word, server-side de-duplication, and clear per-user feedback (“You were muted”).
4. Privacy gotchas. Cloud STT receives raw audio before consent is obtained. Recording starts before all participants’ banners settle. Each is a regulator-letter waiting to happen.
5. Latency creep. A demo at 300 ms degrades to 1.2 s at scale. Set up production SLOs from day one and alert on p95 breaches.
KPIs that prove a voice-command feature is working
Quality KPIs. Command recognition accuracy across the top 7 commands (target >95% on representative data); false-trigger rate (target <1 per hour of meeting); accent equity gap (US vs non-US accent accuracy delta <5 points).
Latency KPIs. p95 wake-to-action latency (target <500 ms simple, <2 s complex); STT first-partial p95; agent function-call round-trip p95.
Adoption KPIs. Weekly active users of voice features as % of meeting users; commands per active user per session; retention of voice-feature users at 30 days; share of meetings with at least one voice command.
Mini case — voice command for a vertical meeting platform
A client running a vertical-SaaS meeting platform for clinical consults asked us to add voice control: hands-free mute, recording start with HIPAA-grade consent, and an in-meeting “summarise the last five minutes” that doctors could trigger between patient blocks.
We shipped in eight weeks: Picovoice for on-device wake word, Azure AI Speech (HIPAA-eligible) for streaming STT in their Azure tenant, a regex command parser for the four core actions, and a Claude agent for the summary path with PHI redaction before egress. Latency landed at 380 ms p95 for simple commands and 2.4 s for the summary — both inside the budget.
Outcome four months in: 41% of clinicians use voice commands at least weekly, false-trigger rate sits at ~0.6/hour, zero recording-consent complaints, and the SOC 2 Type II audit closed without findings on the voice path. The lesson: the wake word and the privacy story are the product. The STT and LLM are just plumbing.
A five-question decision framework
Q1. Is voice the differentiator, or a feature? Differentiator → build. Feature → integrate.
Q2. What languages and accents must work on day one? English-only → Deepgram. Multilingual → AssemblyAI, Google or Azure.
Q3. What’s your regulated data exposure? HIPAA, PCI or FedRAMP → Azure or self-hosted; not OpenAI Realtime by default.
Q4. Are commands one-shot or conversational? One-shot → deterministic parser. Conversational → LLM agent (OpenAI Realtime, LiveKit Agents, Claude).
Q5. Where will the wake word run? Always on-device unless you can prove a regulator-grade reason it cannot. The privacy story is too valuable to surrender.
Reach for OpenAI Realtime when: the agent must hold a multi-turn conversation, latency budget <1 s round-trip is acceptable, and your data is not subject to HIPAA/FedRAMP. Otherwise reach for a hybrid Deepgram + LiveKit Agents stack.
When voice command is the wrong feature to ship
If your meeting product is for noisy environments (warehouses, contact centres without headsets), accent diversity is high, and false triggers carry a real cost (mistakenly muting a customer mid-call), voice command is a bad first feature. Ship richer keyboard shortcuts and gesture controls first.
Likewise, if your buyer is a regulated industry with no HIPAA/FedRAMP-eligible STT in the languages you need, the work to make it compliant may exceed the value — review with your customer before committing.
Ready to ship voice commands without surprises?
We run focused 6–10 week engagements that ship a privacy-first voice layer with the latency, accuracy and compliance story you can hand procurement.
FAQ
What are voice command tools for virtual meetings?
Software that lets users control a meeting (mute, raise hand, record, summarise, translate) and trigger actions hands-free using spoken commands. In 2026 the stack is wake-word on-device, streaming STT in the cloud, an LLM-style intent parser, and instant action dispatch.
Can voice commands work without sending audio to the cloud?
Wake-word detection can and should run entirely on-device (Picovoice, Vosk). Streaming STT for full commands typically still needs the cloud or an on-prem deployment for accuracy — though small on-device models are catching up. The sweet spot is on-device wake word + cloud STT with zero-retention.
What’s the latency budget for usable voice control?
Under 500 ms p95 end-to-end for simple commands (mute, raise hand). Under 2 s for complex requests (“summarise the last five minutes”). Above those thresholds users abandon the feature within days.
Is wake-word detection privacy-safe?
Yes if implemented correctly. The wake-word model runs on-device against a tiny rolling audio buffer; nothing leaves the device until the trigger fires. Audit the implementation and document it in your privacy policy.
Are voice command tools HIPAA-compliant?
Some are, with the right configuration. Azure AI Speech and AWS Transcribe Medical are HIPAA-eligible; Deepgram offers HIPAA-compliant deployments under enterprise contracts. Your end-to-end pipeline (wake word, STT, LLM, storage) must all be on a HIPAA-eligible footprint with signed BAAs.
How do I handle multiple participants triggering commands at once?
Run the wake-word detector per-participant, never on the mixed audio stream. Tag each command with the speaker’s identity, dispatch only on that user’s context, and surface unambiguous confirmation (“Carla, you were muted”) so collisions are obvious.
Should I use OpenAI Realtime API or build a hybrid stack?
OpenAI Realtime is excellent for conversational, premium-tier features. For high-volume, latency-sensitive simple commands, a hybrid Deepgram-plus-deterministic-parser stack costs less and is more predictable. Many products run both: cheap path for “mute me,” expensive path for “catch me up.”
How do I evaluate accuracy across accents and languages?
Build a labelled test set drawn from your real user base (or one that mirrors it — UK English, Indian English, Spanish, Mandarin, etc.). Track command-level accuracy per cohort, fail the build when any cohort drops more than 5 points below the median. More on noisy-environment accuracy here.
What to Read Next
Voice agents
Build & Deploy LiveKit AI Voice Agents
A step-by-step build guide for the agent layer in this stack.
AI assistants
AI Call Assistants — A Practical Guide to Third-Party APIs
When buying beats building — APIs to embed today.
STT
5 Tips for Effective Speech-to-Text in Live Streaming
Pricing, latency and integration patterns for the STT layer.
Conferencing
12 AI Video Conferencing Features Worth Shipping
The wider feature set voice command sits inside.
Translation
7 Tools for Real-Time Multilingual Translation in Video Calls
When “translate captions” is the killer voice feature.
Ready to ship voice commands users actually keep using?
A useful voice-command layer in 2026 isn’t an LLM party trick. It’s a four-layer stack — on-device wake word, streaming STT, hybrid intent parser, instant action dispatch — engineered to a <500 ms latency budget with a privacy story you can hand procurement on the way out the door.
Pick the twelve commands that matter, not thirty that don’t. Build only the path no vendor will ship for you. Measure command accuracy, false-trigger rate and accent equity from week one. The product wins are quiet: hands-free participants, fewer misclicks, faster meetings, and one less reason to leave for a competitor.
Let’s map your voice-command roadmap together
30 minutes with a Fora Soft architect — bring your meeting product, leave with a stack picked, latency budget set and a 6–10 week ship plan.


.avif)

Comments