Published 2026-05-31 · 17 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

If your product talks back — a meeting assistant that reads a summary aloud, a language tutor, a customer-support voice agent, a video dub, or an accessibility narrator — you will choose a text-to-speech engine, and that one choice sets your latency, your bill, and your privacy posture all at once. This article is for the product manager, founder, or engineering lead deciding between hosting an open-source voice model and paying for a hosted API. By the end you will understand why "first audio in under 300 milliseconds" is the metric that matters, how the four leading 2026 options compare on speed and cost, the trap of measuring the wrong kind of latency, and a clear rule for when free-but-self-hosted beats paid-but-managed. The goal is that you can walk into a vendor conversation and ask the right questions instead of being sold on a single headline number.

What "Streaming" Actually Means For Speech

Start with the problem streaming solves. A text-to-speech engine — the software that turns written words into a spoken audio recording, abbreviated TTS — can work in two ways. The simple way is to take your whole sentence, generate the entire audio clip, and hand it back when it is done. The streaming way is to start sending out sound as soon as the first few words are ready, while the rest is still being made.

Think of the difference like a kettle versus a tap. The non-streaming engine is a kettle: you wait for the whole thing to boil, then you pour. The streaming engine is a tap: water starts flowing the instant you turn it, and keeps flowing. For a pre-recorded audiobook, the kettle is fine — nobody is waiting live. For a back-and-forth conversation, the kettle's wait feels like the line went dead, and you need the tap.

Why does this matter so much? Because human conversation has a rhythm. When one person stops talking, the other normally starts within about two-tenths of a second (Gradium, 2026). That gap — roughly 200 milliseconds, where a millisecond is one-thousandth of a second — is so consistent across languages that our ears treat anything much longer as a hesitation. If a voice agent takes a full second to start replying, the conversation stops feeling like a conversation and starts feeling like a phone menu.

Streaming buys back that time. Instead of waiting for the engine to finish the whole reply, the listener hears the opening words while the engine is still working on the end of the sentence. The audio plays at its own natural pace — about 150 words a minute — and as long as the engine can generate faster than the ear consumes, the listener never hears a gap.

Two stacked timelines comparing batch and streaming text-to-speech for the same sentence; the batch lane generates the whole clip before any audio plays, while the streaming lane produces small chunks and delivers first audio far sooner, with the gap between the two first-audio markers labelled as the latency the listener feels and a dashed line marking the 300-millisecond conversational threshold. Figure 1. Batch waits for the whole clip; streaming delivers first audio while the rest is still being made — which is the delay the listener actually feels.

Time-To-First-Audio Is The Number That Counts

There is one metric that decides whether a streaming voice feels responsive, and it is easy to get wrong. The right metric is time-to-first-audio, abbreviated TTFA: the time between sending your text and hearing the first playable piece of sound (Gradium, 2026). It is the part of the delay the listener actually feels, because once the first chunk arrives, playback begins and the engine races to stay ahead of the ear.

The metric people often quote instead is time-to-first-byte, abbreviated TTFB: the time until the first piece of data comes back. The trap is that the first bytes are frequently not sound at all. Audio files start with a header — a small block of bookkeeping data that says "this is a WAV file, here is its sample rate" — and that header carries no speech. A service can return its header in a few milliseconds and look lightning-fast on a TTFB chart, while the first actual audible sample arrives much later (Gradium, 2026). Measuring TTFB rewards the wrong thing. Always insist on time-to-first-audio, measured after the header is stripped.

A single time axis starting at "send text"; an early orange tick marks the first byte labelled "header — no sound", and a later green tick marks first audio (TTFA) labelled "the delay the listener feels", with the shaded gap between them annotated "TTFB looks fast but carries no speech". Figure 2. The first byte is often just a file header. Measure time-to-first-audio, not time-to-first-byte.

How fast is fast enough? The independent 2026 benchmarks converge on a simple rule. Under 300 milliseconds of TTFA feels conversational; under 200 milliseconds is excellent; above 400 milliseconds starts to feel like waiting (Gradium, 2026). And remember that TTS sits at the end of a chain: in a voice agent, the system first listens (speech-to-text), then thinks (a language model), then speaks (TTS). Every millisecond the voice engine adds lands directly on top of everything before it, so a tight TTS budget leaves room for the rest.

Why Consistency Matters As Much As Speed

A median speed hides a second problem: how steady that speed is. Engineers measure the typical case with the median, written P50 — the value half of all requests beat. But they also watch the spread, often summarized as the interquartile range, abbreviated IQR: the gap between the fast quarter and the slow quarter of requests. A low median with a wide spread means most calls feel snappy but a meaningful fraction feel broken — and at the scale of thousands of concurrent calls, "a fraction" is a lot of unhappy users.

The 2026 data makes the point. ElevenLabs Turbo v2.5 posts a 264-millisecond median with a tight 28-millisecond spread, so almost every request feels the same (Gradium / Coval, 2026). Cartesia Sonic-3 is faster at the median, 188 milliseconds, but its spread is 100 milliseconds — over three times wider — so a real slice of requests cross the 300-millisecond line into "slow" territory (Gradium / Coval, 2026). Speed and steadiness are two different questions. Ask both.

Horizontal bar chart of median time-to-first-audio: Cartesia Sonic-3 at 188 milliseconds, ElevenLabs Turbo v2.5 at 264 milliseconds, ElevenLabs Flash v2.5 at 288 milliseconds, each with a whisker showing its latency spread, and OpenAI TTS-1-HD shown as a truncated bar labelled 2,295 milliseconds batch only; a dashed vertical line marks the 300-millisecond conversational threshold. Figure 3. Median time-to-first-audio for the hosted engines (Coval, May 2026). The whiskers show the spread; OpenAI's HD model is a batch tool, not a real-time one.

The Two Ways To Get A Voice: Host It Or Rent It

Every streaming-TTS decision comes down to one fork. You can host an open-weights model — download the model file, run it on your own hardware, pay only for that hardware — or you can rent a hosted API, where the vendor runs everything and bills you per character of text. The four options in this article line up neatly on that fork: Kokoro is the host-it option; ElevenLabs, Cartesia, and OpenAI are the rent-it options.

"Open-weights" means the trained model file is published for anyone to download and run. Kokoro goes further: its weights carry the Apache 2.0 licence, a permissive open-source licence that allows commercial use with no per-use fee and no requirement to share your own code (hexgrad, 2026). That is the heart of its appeal. Once the model runs on your server, generating a million characters of speech costs only the electricity and the rented GPU time, not a metered fee to a vendor.

Renting flips the trade. You write no infrastructure, you get the vendor's newest model automatically, and you start in an afternoon. In return you pay per character, your users' text leaves your network and travels to the vendor, and you live with whatever latency the vendor's servers and your distance from them produce. For many teams that is the right trade — until the volume gets large enough that the per-character bill dwarfs the cost of a GPU.

A two-column comparison card titled host it versus rent it; the host-it column for Kokoro lists GPU-and-electricity cost, you run the servers, audio stays in your network, 8 languages, best at very high volume; the rent-it column for ElevenLabs, Cartesia and OpenAI lists per-character cost, start in an afternoon, audio leaves to the vendor, 32 to 40-plus languages, best at low-to-mid volume, with the winning cell of each row lightly tinted. Figure 4. The host-vs-rent fork, row by row. Each tinted cell is the option that usually wins that criterion.

Meet The Four Engines

Kokoro — the free model you host yourself

Kokoro is an open-weights TTS model with 82 million internal parameters — the adjustable numbers a model learns during training (hexgrad, 2026). Eighty-two million sounds large, but in model terms it is tiny: many rival voice models are ten to a hundred times bigger. That smallness is the whole point. A small model runs fast on cheap hardware, and Kokoro reaches roughly 210 times real-time on a single consumer RTX 4090 graphics card — meaning one second of compute produces about 210 seconds of speech (hexgrad, 2026).

Under the hood, Kokoro is built on two published research designs: StyleTTS 2 for shaping the voice and ISTFTNet for turning the model's internal representation into a sound wave (Li et al., 2023; Kaneko et al., 2022). Crucially, at speaking time it runs as a single forward pass with no slow iterative "diffusion" step, which is what keeps it quick (hexgrad, 2026). Version 1.0, released in January 2025, ships 54 voices across 8 languages and outputs 24-kilohertz audio — broadcast-adjacent quality from a model that fits on a laptop (hexgrad, 2026).

The economics are the headline. Served through a cloud API, Kokoro runs at under one dollar per million characters of text, or under six cents per hour of audio produced (hexgrad, 2026). Host it yourself at volume and the marginal cost falls further still. The catch is everything a managed vendor would otherwise handle: you run the servers, you scale them, you keep them up, and you accept that an 82-million-parameter model, while excellent, will not match the very top of the expressive-quality range on emotional nuance.

ElevenLabs Turbo and Flash — the fast, broad, paid pair

ElevenLabs is the most widely used commercial voice vendor, and for streaming it offers two fast models. Flash v2.5 is the speed specialist: about 75 milliseconds of model inference time, across 32 languages, at five cents per thousand characters (ElevenLabs, 2026). Turbo v2.5 trades a little speed for a little more polish and is built for interactive use. Note the gap between the marketing number and the measured one: that 75 milliseconds is inference time only — the model's own thinking — and real end-to-end TTFA, after network travel and server queuing, lands closer to 250–290 milliseconds in independent tests (ElevenLabs, 2026; Gradium / Coval, 2026).

What you are buying with ElevenLabs is breadth and steadiness. Both fast models cover 32 languages, the latency spread is tight (a 28-millisecond IQR), and the API is mature: it offers a regular endpoint, a streaming endpoint built on the web's standard server-sent-events mechanism, and a bidirectional WebSocket endpoint designed for feeding in text as a language model produces it (ElevenLabs, 2026). One of this article's secondary topics, the ElevenLabs voice isolator, is a sibling tool on the same platform: it strips background noise out of a recording to leave clean speech, billed at 1,000 characters per minute of audio and available from the Starter plan up (ElevenLabs, 2026). It is not a TTS engine, but teams building voice features often need both.

Cartesia Sonic — the latency record-holder

Cartesia's Sonic line is engineered around one goal: the lowest possible first-audio latency. It uses a state-space model architecture — a newer alternative to the transformer designs most AI models use, tuned for streaming a continuous signal smoothly (Cartesia, 2026). On the independent Coval benchmark, Sonic-3 posts the second-fastest median of any model tested, 188 milliseconds (Gradium / Coval, 2026). Pricing starts free with 20,000 monthly credits, then climbs through a $4 Pro tier, a $39 Startup tier, and a $239 Scale tier, with instant voice cloning from Pro and professional cloning from Startup (Cartesia, 2026).

The honest caveat, visible only in the independent data, is consistency. Sonic-3's latency spread is 100 milliseconds — wider than ElevenLabs — so while its typical request is very fast, a meaningful fraction drift toward the conversational threshold (Gradium / Coval, 2026). For a product where the worst case matters as much as the average, that spread is the number to weigh.

OpenAI TTS — convenient if you already live there

OpenAI's text-to-speech is the easy default for teams already building on OpenAI. The cost-optimized model, gpt-4o-mini-tts, runs at roughly 1.5 cents per minute of audio; the older tts-1 is $15 per million characters and tts-1-hd is $30 (OpenAI, 2026). It supports streaming and a useful range of output formats, including the low-latency Opus codec (OpenAI, 2026).

But the independent benchmarks are blunt about its place: the high-quality tts-1-hd model posts a median TTFA above two seconds, which rules it out for live conversation and marks it as a batch tool — excellent for pre-rendering narration or dubbing where nobody is waiting, wrong for a real-time agent (Gradium / Coval, 2026). OpenAI also offers no WebSocket TTS endpoint, only HTTP-based delivery, which adds a small per-turn cost in multi-turn conversations (Gradium, 2026). If you need conversational speed, look to the other three; if you need convenient batch narration inside an existing OpenAI stack, it fits.

The 2026 Comparison, Side By Side

The table below collects the numbers that decide most choices. Latency figures are independent time-to-first-audio medians from the May 2026 Coval benchmark, not vendor marketing claims; prices and language counts are read from each vendor's own pages in May 2026. The "winner" of each column is not a single product — it depends on which column your product cares about most.

Engine Hosted or self-host Median TTFA (P50) Latency spread (IQR) Price Languages Best for
Kokoro-82M Self-host (Apache 2.0) <300 ms* n/a (your hardware) ~$1 / M chars served 8 High volume, privacy, cost floor
ElevenLabs Turbo v2.5 Hosted API 264 ms 28 ms $0.10 / 1k chars (v2/v3) 32 Broad languages + steady latency
ElevenLabs Flash v2.5 Hosted API 288 ms 28 ms $0.05 / 1k chars 32 Lowest-cost low-latency, 32 langs
Cartesia Sonic-3 Hosted API 188 ms 100 ms from $0 (20k credits free) 40+ Fastest median, latency-first
OpenAI gpt-4o-mini-tts Hosted API — (streaming) ~$0.015 / min multi Batch narration in an OpenAI stack
OpenAI tts-1-hd Hosted API 2,295 ms 1,062 ms $30 / M chars multi Batch only — not real-time

*Kokoro's first-audio latency depends entirely on your own hardware; on a single RTX 4090 it generates ~210× faster than real time, comfortably inside the conversational budget for short replies (hexgrad, 2026).

Read the table as a set of trade-offs, not a leaderboard. Cartesia wins raw median speed but spreads wider. ElevenLabs wins steadiness and language breadth. Kokoro wins long-run cost and privacy but hands you the operations bill. OpenAI's fast model is a convenience play, and its HD model is a batch tool wearing a real-time label.

A Worked Example — What A Voice Agent Costs Per Month

Numbers in a pricing table feel abstract until you run them against real usage, so let us cost a concrete product: a customer-support voice agent that speaks 200 hours of generated audio a month.

First, convert hours of speech to characters of text, because that is what hosted APIs bill. Speech runs at about 150 words a minute, and English words average about 6 characters including the trailing space. So one minute of audio is roughly:

150 words × 6 characters = 900 characters per minute

Two hundred hours is 12,000 minutes, so the monthly text volume is:

12,000 minutes × 900 characters = 10,800,000 characters ≈ 10.8 million characters

Now price that 10.8 million characters against each option:

ElevenLabs Flash v2.5  : 10.8M × $0.05 / 1,000  = $540 / month
ElevenLabs Turbo v2.5  : 10.8M × $0.10 / 1,000  = $1,080 / month
Cartesia Sonic         : 10.8M × ~$0.04 / 1,000 = ~$432 / month
Kokoro (self-hosted)   : ~$1 per million served = ~$11 / month of compute + your ops time

The spread is the lesson. At 200 hours a month, the hosted options run from roughly $430 to $1,080, while a self-hosted Kokoro instance costs around ten dollars of GPU time plus the salary cost of the engineer who keeps it running. Below a few dozen hours a month, the hosted bill is trivial and self-hosting is not worth the operational burden. Somewhere as volume climbs, the lines cross, and the open model's near-zero marginal cost wins decisively. The break-even point is not a fixed number — it depends on your engineering cost and uptime needs — but every team at scale should know roughly where their line sits.

How Streaming Is Wired — Chunks, Sentences, And Sockets

To get the low latency the benchmarks promise, the text has to be fed to the engine the right way, and the audio has to be carried back over the right kind of connection. Both choices are within your control, and getting them wrong can erase a fast model's advantage.

On the input side, the engine cannot start until it has enough text to say something natural. Read a single word and the engine has no idea how to inflect it; wait for a whole paragraph and you have reintroduced the kettle delay. The fix is chunking: feeding text in small batches sized so the engine can begin without stalling. ElevenLabs exposes this directly through a setting called the chunk-length schedule, a list of character counts at which it commits to generating — for example, generate after 120 characters, then 160, then 250 (ElevenLabs, 2026). It also offers an auto mode that picks those boundaries for you by reading the incoming text, and a flush command to force out whatever is buffered when you reach the end (ElevenLabs, 2026). When the text comes from a language model that emits words one at a time, this matters a great deal: chunk too eagerly and prosody suffers; chunk too late and latency suffers.

On the transport side, there are three ways to carry the audio back, and they suit different situations. A regular request returns the whole clip at once — simplest, slowest to first sound. A server-sent-events stream, defined in the web's HTML standard, pushes audio chunks down a one-way HTTP channel as they are made, which lowers time-to-first-audio and is ideal when you already have the full text up front (WHATWG, 2026; ElevenLabs, 2026). A WebSocket, the two-way persistent connection standardized as RFC 6455, lets you keep feeding text in while audio streams back out, which is exactly the shape of a live agent reading a language model's output as it is generated (IETF, 2011; ElevenLabs, 2026). The WebSocket pays a small one-time setup cost to open the connection, but reusing one open connection across many turns saves 50 to 100 milliseconds per turn versus reopening an HTTP request each time (Gradium, 2026). For a chat-style agent, keep one socket open.

A decision tree for choosing the transport: a central diamond asks whether you have the full text up front; the yes branch leads to a box reading use SSE streaming for a one-way HTTP push, and the no branch, for text streaming token-by-token from a language model, leads to a box reading use a WebSocket and reuse one connection per session; a side note records that a regular request is simplest but slowest to first audio. Figure 5. Choose the transport by how the text arrives: SSE when you have it all up front, a reused WebSocket when it streams from a language model.

One more knob worth knowing: geography. ElevenLabs serves its fast models from regional data centers, and the same request that returns first audio in 100–150 milliseconds from North America or Europe takes 150–200 milliseconds from parts of Asia simply because the bytes travel farther (ElevenLabs, 2026). If your users cluster in one region, pick a vendor with a server near them, or self-host close to your audience.

A Common Mistake — Optimizing The Model And Ignoring The Pipeline

The single most frequent error teams make is to pick the model with the lowest advertised inference number and then wonder why the live product still feels sluggish. The advertised "75 milliseconds" is the model's thinking time in isolation. The number your user feels is the sum of four things: the network trip to the server, any queue wait when the service is busy, the model's generation time, and the network trip back (ElevenLabs, 2026). A 75-millisecond model reached over a slow connection, behind a busy queue, with a fresh HTTP handshake every turn, can easily feel like half a second.

The fix is to measure the whole pipeline, end to end, from your own users' locations, using time-to-first-audio and not time-to-first-byte. Open one persistent WebSocket and reuse it. Choose a server region near your users. Tune your chunk schedule against real language-model output, not a static test sentence. A mid-pack model wired correctly beats a benchmark champion wired carelessly, every time.

Where Fora Soft Fits In

We build the products that consume streaming TTS — video conferencing tools with spoken meeting summaries, telemedicine platforms with voice-driven intake, e-learning systems with narrated lessons, and OTT services with automated dubbing. In that work the engine choice is rarely about the single fastest model; it is about matching the engine to the product's real constraints. A privacy-sensitive telemedicine deployment often points toward a self-hosted open model so audio never leaves the customer's environment, while a multilingual conferencing feature usually favours a hosted API for its language breadth and zero operations burden. The pattern we apply is to fix the latency budget first, then the privacy and language requirements, and only then compare prices — because a cheaper engine that misses the latency budget is not cheaper, it is unusable.

How To Choose — Five Questions

Work through these in order; each one narrows the field.

  1. Is this real-time or batch? If nobody is waiting live (pre-rendered narration, overnight dubbing), latency is irrelevant — pick on quality and price, and OpenAI's HD model or ElevenLabs Multilingual v2 are fine. If it is conversational, latency rules everything below.
  2. What is your monthly volume in hours? Convert to characters (≈900 per minute). Under a few dozen hours, rent. Above a few hundred, model the self-host break-even seriously.
  3. How many languages, and which? Need broad coverage? ElevenLabs (32) or Cartesia (40+). English-heavy and cost-sensitive? Kokoro's 8 languages may be enough.
  4. Can the audio leave your network? If privacy or regulation says no, self-host Kokoro. If yes, the hosted APIs are open to you.
  5. Does the worst case matter as much as the average? If a slow tail call breaks the experience at scale, weigh the latency spread, not just the median — favour the tighter spread (ElevenLabs) over the faster-but-wider one (Cartesia).

What To Read Next

Talk To Us / See Our Work / Download

  • Talk to a video engineer about adding a low-latency voice feature to your product → /services/ai-software-development
  • See our case studies in e-learning, telemedicine, OTT, and video conferencing → /cases
  • Download the Streaming TTS Engine Selection Checklist (one page, printable) → Download the checklist

References

  1. hexgrad. Kokoro-82M (model card), Hugging Face, accessed 2026-05-31. https://huggingface.co/hexgrad/Kokoro-82M. First-party source for the model's facts: 82 million parameters, Apache 2.0 licence (commercial use permitted), decoder-only StyleTTS 2 + ISTFTNet architecture with no diffusion or encoder at inference, v1.0 released 27 January 2025 with 54 voices across 8 languages and 24 kHz output, ~210× real-time on a single RTX 4090, ~$1,000 total training cost on ~1,000 A100 GPU-hours over a few hundred hours of permissively licensed audio, and the served market rate of under $1 per million characters (≈$0.06 per hour of audio, ≈1,000 characters per minute).
  2. Li, Y., Han, C., Raghavan, V., Mischler, G., Mesgarani, N. StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models, arXiv:2306.07691, 2023, accessed 2026-05-31. https://arxiv.org/abs/2306.07691. Primary academic source for the voice-modelling architecture Kokoro builds on — style diffusion as a latent variable and adversarial training with a speech language model. Original algorithm paper, the source of truth for the design rather than any vendor summary.
  3. Kaneko, T., Tanaka, K., Kameoka, H., Seki, S. iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform, arXiv:2203.02395, 2022, accessed 2026-05-31. https://arxiv.org/abs/2203.02395. Primary academic source for the vocoder Kokoro uses to convert the model's internal representation into a sound wave quickly using an inverse short-time Fourier transform rather than a slow neural upsampler.
  4. ElevenLabs. Models and Latency optimization, ElevenLabs Documentation, accessed 2026-05-31. https://elevenlabs.io/docs/overview/models and https://elevenlabs.io/docs/eleven-api/guides/how-to/best-practices/latency-optimization. First-party source for Flash v2.5 ~75 ms inference time (inference only, not end-to-end), 32-language coverage, the three endpoint types (regular / server-sent-events streaming / bidirectional WebSocket), the chunk-length schedule, auto-mode and flush controls, the voice-type latency order (default / IVC faster than PVC), and the regional TTFB table (100–150 ms in North America / Europe, 150–200 ms in parts of Asia).
  5. ElevenLabs. ElevenAPI Pricing, accessed 2026-05-31. https://elevenlabs.io/pricing/api. First-party source for the per-character API rates used in the worked example: TTS Flash/Turbo $0.05 per 1,000 characters and Multilingual v2/v3 $0.10 per 1,000 characters.
  6. ElevenLabs. Voice isolator (capability and product guide), accessed 2026-05-31. https://elevenlabs.io/docs/overview/capabilities/voice-isolator. First-party source for the voice-isolator secondary topic: a background-noise-removal tool billed at 1,000 characters per minute of audio, available from the Starter plan up, accepting WAV/MP3/FLAC/OGG/AAC inputs up to 500 MB and one hour, exposed via REST API and SDKs — distinct from text-to-speech.
  7. Cartesia. Pricing and Sonic, accessed 2026-05-31. https://www.cartesia.ai/pricing and https://www.cartesia.ai/sonic. First-party source for Cartesia's 2026 self-serve plans (Free 20,000 credits; Pro $4; Startup $39 with professional voice cloning; Scale $239), the per-second voice-changer and one-time voice-localization credit costs, instant voice cloning from Pro, and the state-space-model architecture positioning Sonic for low-latency real-time streaming.
  8. Gradium. TTS Latency Benchmark 2026: TTFA Compared Across Gradium, ElevenLabs, Cartesia and Deepgram, 5 May 2026, accessed 2026-05-31. https://gradium.ai/content/tts-latency-benchmark-2026. Independent (vendor-published but citing the third-party Coval benchmark) source for the May 2026 time-to-first-audio medians and spreads used throughout: ElevenLabs Turbo v2.5 264 ms / 28 ms IQR / 5.2% WER, Flash v2.5 288 ms / 28 ms IQR, Cartesia Sonic-3 188 ms / 100 ms IQR, OpenAI TTS-1-HD 2,295 ms / 1,062 ms IQR; the TTFA-vs-TTFB distinction (container headers carry no audio); the ~200 ms human turn-taking gap and ~300 ms conversational threshold; and the 50–100 ms per-turn saving from WebSocket reuse. Where this benchmark and a vendor's marketing latency claim diverge (e.g. ElevenLabs' 75 ms vs the ~288 ms measured end-to-end), the article reports both and labels the marketing figure as inference-only.
  9. Coval. TTS Benchmark, accessed 2026-05-31. https://benchmarks.coval.ai/tts. Independent voice-AI evaluation platform whose continuously refreshed production-endpoint measurements underlie the latency figures cited via reference 8; not affiliated with any TTS provider.
  10. OpenAI. Text-to-speech models and API pricing (gpt-4o-mini-tts, tts-1, tts-1-hd), accessed 2026-05-31. https://platform.openai.com/docs/guides/text-to-speech and https://openai.com/api/pricing/. First-party source for OpenAI's TTS pricing (tts-1 $15 / M chars, tts-1-hd $30 / M chars, gpt-4o-mini-tts ~$0.015 per minute of audio), the voice line-ups (9 voices on tts-1/-hd, 13 on gpt-4o-mini-tts), streaming support, the output formats including low-latency Opus, and the absence of a WebSocket TTS endpoint.
  11. IETF. RFC 6455 — The WebSocket Protocol, December 2011, accessed 2026-05-31. https://www.rfc-editor.org/rfc/rfc6455. Primary standards source for the bidirectional persistent-connection protocol used for real-time text-in / audio-out TTS streaming.
  12. WHATWG. HTML Living Standard — Server-sent events, accessed 2026-05-31. https://html.spec.whatwg.org/multipage/server-sent-events.html. Primary standards source for the one-way server-push mechanism (server-sent events / EventSource) that the HTTP streaming TTS endpoints are built on.