
A translator works with written text. An interpreter works with spoken language. AI real-time translation does both at once, with caveats. The decision between the three depends on the medium, the stakes, the legal requirements, and the budget, in that order. This guide walks the decision in five questions and supplies the cost ranges, comparison tables, and four worked examples a buyer needs to make the call without a vendor in the room.
Key takeaways
• Translators work with written text. Interpreters work with spoken language. The split is that simple and that consequential.
• The four modes of interpretation are simultaneous, consecutive, whispered (chuchotage), and sight translation. Each fits a specific setting.
• Simultaneous interpretation costs $150 to $750 per hour with a two-interpreter rotation. Consecutive runs $35 to $120 per hour and doubles the meeting length.
• AI real-time translation is appropriate for low-stakes corporate webinars, internal training, and customer-support chat in 2026. It is not appropriate for legal proceedings, medical-of-record interactions, or high-stakes public events where certified human interpretation is the standard.
• Below $0.10 per minute, AI real-time is the only viable option. Above $8 per minute, full human simultaneous interpretation remains the standard. Hybrid sits between.
Trying to decide between interpreter, translator, or AI?
30 minutes with a senior engineer who has shipped real-time translation on WebRTC, LiveKit, and custom SFUs across 250+ Fora Soft video projects since 2005.
Translator vs interpreter: the core distinction in 30 seconds
The vocabulary trips people up. In casual English, “translator” gets used for both jobs. In the industry, the two words name two different professions.
Translators work with text. Contracts, websites, books, manuals, marketing copy, subtitles. The work is written. It is reviewed. A second linguist often proofreads it. Mistakes can be caught, traced, and corrected before the document ships.
Interpreters work with speech. Meetings, depositions, conferences, doctor visits, hearings, broadcast. The work happens live. There is no draft. A mistake is in the room the moment it is made.
Both professions translate meaning, not words. A literal one-to-one mapping rarely survives a language boundary intact. A good translator and a good interpreter both reach for the equivalent idea, idiom, register, and emotional weight, then deliver it in the target language. The skill is the same. The medium is different.
The other thing that differs is time. Translation is asynchronous. The translator gets the source, takes hours or days, returns a finished file. Interpretation is synchronous. The interpreter hears the speaker, processes, and delivers, within seconds. Simultaneous interpreters compress that loop into a 2-to-4 second lag.
If the source is on paper, on screen, or in a file, you want a translator. If the source is a voice in a room or on a call, you want an interpreter. That is the rule. Everything else is detail.
The four modes of interpretation, and which one you actually need
Interpretation comes in four forms. The right one depends on the setting.
| Mode | Setting | Latency | Equipment | Typical cost |
|---|---|---|---|---|
| Simultaneous | UN sessions, conferences, broadcast | 1–3 seconds | Booth, headsets, two interpreters | $150–$750/hour |
| Consecutive | Depositions, medical visits, meetings | Pause-and-translate (2x meeting length) | None required | $35–$120/hour |
| Whispered (chuchotage) | 1–2 listeners inside a larger meeting | 1–3 seconds | None | Similar to simultaneous, slightly lower |
| Sight translation | Courts, sworn statements read aloud | Real-time read-out | None | Rolled into consecutive hourly rate |
Simultaneous interpretation has the interpreter speaking at the same time as the speaker, with a delay of one to three seconds. Two interpreters are not a luxury, they are a clinical requirement. The cognitive load is famously brutal. AIIC, the International Association of Conference Interpreters, codifies the two-interpreter rotation as standard for sessions over 30 minutes.
Consecutive interpretation has the speaker talk for a minute or two and pause. The interpreter then delivers the translated version. No booth required. One interpreter is usually sufficient. The meeting doubles in length, because every utterance is delivered twice.
Whispered interpretation (chuchotage) has the interpreter sit next to one or two listeners and speak the translation directly into their ear, in real time. Used when a small minority of attendees needs translation inside a larger meeting. It taxes the interpreter harder than booth-based simultaneous, because there is no soundproofing and no rotation partner. Sessions are limited to 30 to 60 minutes before a break.
Sight translation has the interpreter read a written document and deliver it out loud in the target language, on the spot. Courts use this for evidence, exhibits, sworn statements presented in court that were never formally translated in advance. It is its own discipline. Most professional interpreters are trained for it.
The third option: AI real-time translation in 2026
The reason buyers are searching for this comparison in 2026 is that there is now a third option, and the existing top-ranked articles on the question barely mention it. The New York Times ran a Feb 2026 feature on AI language translation in conference settings. ChatGPT, Perplexity, and Google AI Overviews are all citing real-time AI translation as a viable path. The question has shifted from “which kind of human do I hire” to “do I need a human at all.”
What AI real-time translation actually is
A chain of three things, running live: speech-to-text on the source, machine translation in the middle, text-to-speech (or captions) on the output. Some systems compress all three into one model (Meta SeamlessM4T, OpenAI GPT-4o Realtime). The result, for the listener, is captions or synthesized voice in another language, with a delay of 800 milliseconds to 4 seconds.
Where AI real-time translation works in 2026
- Corporate webinars and town halls.
- Internal training and onboarding sessions.
- Customer-support chat.
- Live class translation in e-learning.
- Internal Slack and Teams threads.
- Pre-recorded video dubbed for global distribution.
- Multilingual customer feedback at scale.
In all these, the room is forgiving. Listeners are participating, not litigating. The cost of an occasional wrong word is annoyance, not liability.
Where AI real-time translation does not work yet
- Legal proceedings of record. Sworn testimony, depositions, immigration hearings, child custody hearings. The transcript has to stand up in court.
- Medical interactions where the record matters: informed consent, surgical pre-op, psychiatric assessment.
- High-stakes public events where the audience is paying to hear a specific speaker in a specific way. Keynotes by heads of state. Awards ceremonies.
- Diplomatic settings. Negotiation depends on subtext, hesitation, and the ability of the interpreter to read the room.
Accuracy and latency expectations
Word error rate on a major-language pair on clean conference audio sits between 8% and 15% for the leading systems in 2026. On accented English, technical jargon, or noisy audio, it climbs to 20% to 30%. The Open ASR Leaderboard at Hugging Face tracks these numbers in near-real-time for major models like Whisper-large-v3. Our vendor benchmark synthesis covers what each of the four leading platforms publishes.
The best 2026 systems deliver translated speech 1.5 to 4 seconds after the speaker. Below 800 milliseconds feels live. Above 2 seconds, listeners start to talk over the interpretation, and the experience collapses.
The decision tree: five questions
Walk it once and you have your answer.
Question 1: Is the content text, speech, or both?
- Text only. You want a translator. Skip ahead.
- Speech only. Continue.
- Both. Continue. You may end up with two workflows, one for each.
Question 2: Is real-time required, or can it be asynchronous?
- Async is fine. A translator handles written text. Recorded interpretation exists but is rare and expensive.
- Real-time required. Continue.
Question 3: What is the legal or regulatory accountability?
- Certified accuracy required. Court, immigration, sworn statements, contract-binding medical, regulated financial advice. You need a certified human interpreter. AI may be used by the legal or medical team for preparation. It does not go on the record.
- Best-effort or internal use. Continue.
Question 4: What is the audience size and how high the stakes?
- Fewer than 20 internal listeners, low stakes. AI real-time with light human spot-checking is usually enough.
- 20 to 500 listeners, medium stakes. Either human RSI or AI with a human QA layer on the captions.
- More than 500 listeners, high stakes. Human simultaneous interpretation is the default. AI may run as a backup or as a captioning channel for accessibility.
Question 5: What does the budget actually support?
- Under $0.10 per minute. AI real-time only. Nothing else fits.
- $0.10 to $2 per minute. AI real-time with human review on captions or a hybrid pipeline.
- $2 to $8 per minute. Hybrid: AI captions plus a human interpreter on the dominant pair, or a rotating human team on shorter events.
- Above $8 per minute. Full human RSI or in-person simultaneous interpretation.
Three of the five questions are inputs from the room. Two are inputs from the legal and finance functions. Get both groups in the conversation early and the decision falls out cleanly.
Cost per minute: the reference everyone bookmarks
The numbers below are list-price bands from public agency pricing in early 2026. Negotiated rates run lower. Long event commitments and multi-day rates lower again.
| Option | Cost per minute | Notes |
|---|---|---|
| Human simultaneous interpretation | $4.50–$25 | $150–$750/hr per language pair, two-interpreter team |
| Human consecutive interpretation | $0.60–$2 | Single interpreter; meeting takes 2x as long |
| Hybrid (AI captions + human on dominant pair) | $1.50–$5 | Splits cost and risk |
| AI real-time translation (cloud) | $0.04–$0.20 | Cloud APIs at moderate volume |
| AI real-time translation (self-hosted) | $0.001–$0.05 | Amortized on saturated GPUs |
Human translator (text): $0.10 to $0.30 per word. For a 5,000-word website translated into five languages, that is $2,500 to $7,500 per language. Machine-translation post-editing brings this down to $0.05 to $0.15 per word at the cost of some nuance.
For the vendor-by-vendor breakdown of AI real-time pricing, see our vendor benchmark synthesis.
Four worked examples
Pick the closest one to your situation. The recommendation falls out.
Example 1: Quarterly all-hands, 4 languages, 1,200 employees
Internal session. The CEO presents in English, with simultaneous Spanish, Portuguese, and Mandarin for distributed offices. Q&A at the end. No external attendees.
What we’d do. AI real-time captioning, primary engine Wordly or DeepL Voice. Run captions in all three target languages. Have a bilingual employee in each region spot-check the dominant language pair and flag major misses on a side channel. Record the session for replay, then run a clean pass of MT post-editing on the transcript before posting.
Cost. $400 to $800 per session in AI fees. One internal reviewer per region, no fee. Full human RSI for the same session would run $4,000 to $8,000.
Example 2: Deposition with a Spanish-speaking witness
Civil litigation. The witness will testify under oath. The transcript becomes a legal record. Opposing counsel will scrutinize every word.
What we’d do. A court-certified Spanish consecutive interpreter, in person or by video link, with credentials from the relevant state or federal certification body. No AI in the record. The legal team may use AI privately, in deposition prep or for searching prior testimony, but the official deposition is human-only.
Cost. $250 to $500 per hour, four-hour minimum, with travel where applicable. Non-negotiable. The risk of a successful objection or a struck deposition transcript dwarfs the cost difference.
Example 3: Marketing website, expand to 5 languages
Product marketing site, 8,000 source words, growing 15% per quarter. Five target languages: Spanish, French, German, Japanese, Brazilian Portuguese. Brand voice matters. SEO matters.
What we’d do. Human translators with localization specialists for each language. Use machine translation as a first pass on long-tail pages, then human post-editing. Reserve native human translation for the homepage, pricing page, top 20 landing pages, and any legal copy. Use a translation memory tool so updates do not require retranslation of unchanged paragraphs.
Cost. $0.05 to $0.15 per word on MTPE pages, $0.18 to $0.30 per word on native human translation. Total project: $8,000 to $20,000 for the initial rollout, $1,500 to $4,000 per quarter for updates.
Example 4: Multilingual customer support chat at 50,000 monthly active users
Mid-market SaaS, English-language product, customers writing in nine languages. Median ticket length 80 words. Support team speaks English plus Spanish.
What we’d do. AI real-time translation on both sides of the chat. Custom glossary for product terminology. Human reviewer flagging anything tagged as billing, security, or churn risk. Auto-escalate to a Spanish-speaking agent when the customer is Spanish and the issue is high-CSAT-impact.
Cost. Under $0.001 per message at scale. The flag-and-escalate workflow is the most expensive part, and it earns its keep on retention. Full human translation of every support ticket would run $0.30 to $0.80 per message, or $200,000 to $500,000 per year at this volume.
When to ask for an interpreter, when to hire a translator
A vocabulary cheat sheet for the procurement conversation.
Say “translator” when the work is written. Documents, web copy, marketing, books, subtitles, certified translations of legal documents.
Say “interpreter” when the work is spoken. Meetings, conferences, depositions, medical appointments, phone calls.
Ask about certifications. ATA for translators in the US. AIIC for conference interpreters globally. NAATI for Australia. NRPSI for the UK. State-level court interpreter certifications for legal work. CCHI or NBCMI for medical interpreters in the US.
Ask about the mode for interpretation: simultaneous, consecutive, whispered, sight. The right answer is venue-dependent and the agency will guide you. Asking the question signals you know what you are buying.
Ask about working language directions. Most interpreters work in two languages, one of which is their A (native) language. They generally interpret into A, not out of A. If you need both directions, you typically need two interpreters, or a rare “bidirectional” specialist.
For AI: ask about the language coverage, the WER on your domain, the latency from your region, the data residency, and whether the vendor offers a custom glossary endpoint. Insist on running a pilot on your own audio before signing the annual contract.
How Fora Soft fits in
We build custom real-time translation and interpretation systems. Since 2005, we have delivered 250+ video projects with a 20+ year track record of shipping production-grade real-time video and AI infrastructure. The translation work includes Translinguist (multilingual event platform), Volo (real-time translation for healthcare and education), and Rafiky (remote simultaneous interpretation for international conferences).
If you have decided that AI real-time translation is the right shape for your product or your event, our vendor benchmark synthesis covers what DeepL Voice, KUDO, Interprefy, and Meta SeamlessM4T actually publish about accuracy, latency, and cost. For the architectural deep-dive on how speech translation fits into a video product, see our real-time speech translation guide. For multilingual video-call patterns specifically, see our multilingual translation in video calls reference.
If you want to talk through where you sit on the decision tree, book a 30-minute call with a senior engineer. No deck, no pitch, the architecture conversation only. Or reach the language-interpretation team directly.
Walk the decision tree with a senior engineer
30 minutes, zero fluff. Bring your room, your audience, and your budget.
FAQ
What is the difference between an interpreter and a translator?
Translators work with written text. Interpreters work with spoken language. Translators have time, revision, and review. Interpreters work in real time, with no second draft. Both translate meaning, not words.
Can AI real-time translation replace a human interpreter?
For low-stakes, internal, real-time speech in 2026, often yes. For legal proceedings, medical-of-record interactions, sworn testimony, and high-stakes public events where certified human interpretation is the standard, no. The acceptable error rate is what changes between those two worlds. Pick on the basis of the consequences of a mistake, not the cost per minute.
Is DeepL Voice as good as a human translator?
DeepL Voice is a speech translation product. It competes with human interpreters, not human translators. For written documents, DeepL Translate and human translators are the comparison. DeepL is good enough for first-draft work and internal documents in major language pairs. For legal, medical, or brand-sensitive text, native human translation remains the standard.
How much does a simultaneous interpreter cost?
$150 to $750 per hour per language pair, paid as a two-interpreter team. Half-day and full-day minimums typically apply. Equipment is separate where venues do not supply it.
What is remote simultaneous interpretation (RSI)?
Simultaneous interpretation delivered over a video platform, with interpreters working from a remote booth or home studio. Listeners select their language channel in the platform UI. Major vendors include KUDO, Interprefy, Boostlingo, and Wordly. RSI runs 20% to 40% cheaper than in-person simultaneous because there is no travel or on-site equipment.
What are the four types of interpretation?
Simultaneous (real-time, booth, conference). Consecutive (pause-and-translate, deposition). Whispered or chuchotage (real-time, no booth, one or two listeners). Sight translation (reading written text aloud in another language).
When is consecutive interpretation better than simultaneous?
When the meeting is small, the stakes are high enough that nuance matters more than speed, and there is no booth available. Depositions, medical visits, business negotiations, and many diplomatic settings use consecutive. The meeting takes twice as long, but every utterance gets two passes.
How do I choose between hiring an interpreter and using AI translation?
Walk the five-question decision tree above. If the content is speech, real-time is required, certified accuracy is not legally mandated, the audience is internal or moderate-stakes, and the budget supports it, AI is usually the right answer. Move toward a human interpreter as any of those conditions tightens.
Conclusion
Three options. Five questions. The cost ranges sit one to four orders of magnitude apart. The right one is decided by audience, stakes, and law, not by feature comparison.
If you have not yet decided whether AI fits your specific room, the vendor benchmark synthesis covers what the four leading systems actually publish. If you want to talk through the architecture, book a 30-minute call.
Read next
Vendor benchmarks
Real-Time Speech Translation Vendors in 2026
DeepL Voice, KUDO, Interprefy, and Meta SeamlessM4T compared on public data.
Video calls
Multilingual Translation for Video Calls
Design patterns for embedding translation into WebRTC.
Pillar guide
Real-Time Speech Translation for Live Video
Architecture overview and engineering constraints.
Services
Custom AI Language Interpretation
Work with the team that builds the systems.
Published May 2026. Updated annually. For corrections or comments, contact the editorial team at eager2develop@forasoft.com.


.avif)

Comments