Published 2026-06-02 · 23 min read · By Nikolay Sapunov, CEO at Fora Soft
Why this matters
If you are a product manager, founder, or engineering lead weighing whether to add an AI notetaker, a meeting copilot, or a voice agent to your product, LiveKit is now the name you will hear first — it powers OpenAI's ChatGPT voice mode and a quarter of US 911 calls, and it raised money at a billion-dollar valuation in early 2026. That popularity makes it easy to adopt and easy to misjudge. The architecture decides what your assistant can do, and the pricing decides whether the feature pays for itself at scale. This article gives you both, in language that does not assume you have ever written WebRTC code, so you can make the build-versus-buy call before you commit, not after.
First, what LiveKit actually is
LiveKit is three things that share a name, and confusing them is the first mistake people make. Keeping them separate makes everything else clear.
The first thing is an open-source media server. A media server is the program that sits in the middle of a group call and forwards each person's audio and video to everyone else. LiveKit's server is free, published under the permissive Apache 2.0 license, written in the Go programming language, and you can run it yourself on a single machine, in Docker, or across a cluster. Technically it is a Selective Forwarding Unit, usually shortened to SFU — a router that receives each participant's media stream once and forwards copies to the others, without re-mixing them. We explain why an SFU is the standard shape for group calls in the video SDK and WebRTC AI comparison; here the point is that this server is the free, open core.
The second thing is the Agents framework. This is a separate open-source toolkit, also free, that lets you write a program — an "agent" — that joins a LiveKit room and behaves like a participant, except it is software. The framework handles the hard real-time plumbing so your code can focus on what the agent should hear, think, and say. It is the piece that turns a plain video call into one with an AI in it.
The third thing is LiveKit Cloud. This is the paid, hosted version of the server, run by the company so you do not have to operate the infrastructure yourself. You write the same code, but LiveKit runs the global network of media servers behind it and bills you for usage. Cloud is optional: the server and the Agents framework are fully open source and run on your own infrastructure at no license cost.
All of this is built on WebRTC. WebRTC, short for Web Real-Time Communication, is the open standard that lets browsers and apps send audio, video, and data to each other with very low delay. It reached its final milestone when the World Wide Web Consortium — the body that standardizes the web — published WebRTC as a full Recommendation on 13 March 2025. Every product in this article speaks that same underlying language; what differs is how much of it each one hands to you.
LiveKit in 2026: the company and the news
Because "livekit news" is one of the most common things people search before adopting it, here is the current picture, dated so you can judge how fresh it is.
LiveKit was founded in 2021 by Russ d'Sa, an early Twitter engineer, and David Zhao, a former director of engineering at Motorola. They started it as an open-source project during the pandemic-era surge in video calling, and the company is based in San Jose, California. The free server reached its version 1.0 in May 2022; the business grew on top of it.
The headline news is funding and adoption. On 22 January 2026, LiveKit announced a Series C round of $100 million at a $1 billion valuation, led by Index Ventures with participation from earlier backers Altimeter, Hanabi Capital, and Redpoint Ventures. That came about ten months after a $45 million Series B in April 2025. The reason investors care is the customer list: LiveKit powers OpenAI's ChatGPT voice mode, and its other users include xAI, Salesforce, Tesla, Spotify, Meta, Microsoft, and Character AI. By the company's own accounting it is the real-time backbone for roughly a quarter of US 911 emergency calls. On the open-source side, the core server repository carries about 19,000 GitHub stars and the Agents framework about 10,800, with the company citing well over 100,000 developers building on the platform.
The practical takeaway for a buyer is that LiveKit is no longer a bet on a small project. It is the default infrastructure layer a large share of the voice-AI industry already runs on, which lowers the risk of building on it and raises the value of understanding how it actually works.
What an "AI meeting assistant" actually is
An AI meeting assistant is a piece of software that takes part in a live call to do useful work around it: it can write down everything that was said, identify who said what, summarize the discussion, list the action items, answer questions when asked, and in some designs speak back out loud. People also call it an AI notetaker when its job is mainly transcription and summary, or a meeting copilot when it can act during the call.
The mental model that matters is simple. The assistant is not a feature bolted onto one person's screen; it is an extra attendee. When you schedule a call and "the notetaker joins," what happens technically is that a program connects to the same room as the humans, listens to the shared audio, and produces its output as text, a recording, or speech. Whether that program is one you wrote or one you rented is the build-versus-buy question this article ends on. Either way, the shape is the same: an attendee made of code.
That framing explains both the power and the limits. Because the assistant is a real participant, it can hear everyone with the same fidelity a human attendee would. But because it is a participant, it consumes the same connection resources a human does, and — as the pricing section shows — you pay for its time in the room just as you pay for everyone else's.
How an AI agent joins a meeting: the dispatch architecture
The first real architecture question is how a program comes to be sitting in a specific call. In LiveKit this is called dispatch — the process of assigning an agent to a room. Walking through it once removes most of the mystery.
Start with the pieces. A room is a single call — a named space that participants connect to. A participant is anyone connected to that room, human or agent. The agent itself runs inside an agent server, the long-running process LiveKit calls the AgentServer, which sits idle waiting for work and launches a fresh agent for each call that needs one. Crucially, each call's agent runs in its own separate operating-system process, so if one agent crashes it cannot take down the others sharing the same machine.
When a call needs an assistant, the agent server starts a job. The starting point of that job is a function in your code called the entrypoint — think of it as the front door the framework opens to hand control to you, much like a web server handing a request to your handler. Inside the entrypoint, your code connects to the room, subscribes to the participants' audio tracks (a track is one stream of media, such as one person's microphone), and begins doing its work. The job keeps running until everyone leaves the room or you shut it down.
There are three ways to trigger dispatch, and the difference matters for a meeting product. Explicit dispatch, the recommended default, means your backend asks LiveKit to send an agent to a specific room and can attach job-specific information — for example, which meeting this is or who the host is — as metadata. Token-based dispatch piggybacks on the access token a user already presents to join: you list the agent in that token, and when the first person creates the room, the agent is dispatched automatically. The catch worth remembering is that token dispatch only fires when the room is first created; if the room already exists, the instruction is ignored. Automatic dispatch sends an agent to every new room without exception, which LiveKit discourages for most products because it wastes resources and cannot carry metadata.
LiveKit reports that this dispatch path is built for scale — typically hundreds of thousands of new connections per second, with the time to place an agent into a room held under about 150 milliseconds. For a meeting product that means the assistant is in the room effectively as the call begins, not noticeably late.
Figure 1. An AI meeting assistant is a program dispatched into the call as an extra participant. It subscribes to everyone's audio, runs a speech pipeline, and publishes its output back into the room.
Inside the agent: the voice pipeline
Once the agent is in the room and hearing audio, the second architecture question is what happens to that audio. LiveKit's Agents framework offers three internal designs, and choosing among them is the single most consequential decision for how your assistant feels and what it costs.
The default design is called the STT-LLM-TTS pipeline, and it is a relay of three stages. Audio flows through them in order. First, speech-to-text — abbreviated STT, also called automatic speech recognition or ASR — turns the spoken words into written text; we cover the production options in the streaming ASR deep-dive. Second, a large language model, or LLM — the kind of AI that reads text and writes a reply — generates the response. Third, text-to-speech, or TTS, turns that reply back into spoken audio, the subject of our streaming TTS comparison. Each stage has a clean boundary, so you can swap one model for another without touching the rest. LiveKit's own guidance is that for most production agents this three-stage pipeline is the right default.
The second design is a realtime model, sometimes called speech-to-speech. Here a single model — such as OpenAI's realtime model or Google's Gemini Live — takes audio in and produces audio out directly, with no separate text stages in the middle. It feels the most natural and responsive because nothing waits for a handoff between stages, and we explore the category in the speech-to-speech models article. It has one trap for a meeting product, covered in the pitfall below.
The third design is a half-cascade: a realtime model understands the incoming speech and produces a text reply, and a separate text-to-speech stage speaks it. It blends the strong listening of the realtime approach with the control of a pipeline on the output side.
Two smaller components sit around all three designs. Voice activity detection, or VAD — a lightweight model (LiveKit ships one called Silero) that simply decides whether someone is speaking right now — stops the agent from wasting effort on silence. And turn detection — a transformer model that judges when a speaker has actually finished their thought rather than just paused — keeps the assistant from interrupting. Both exist to make the interaction feel human. LiveKit's stated target is to keep end-to-end response latency under one second, which is the threshold below which a spoken exchange feels natural; the broader budget is broken down in our sub-100ms latency article.
Figure 2. Three ways to build the agent's brain. The three-stage pipeline gives the most control and live transcripts; the realtime model feels the most natural but skips interim text; the half-cascade splits the difference.
Common pitfall: teams building a notetaker reach for a realtime speech-to-speech model because it sounds the most advanced, then discover it does not emit interim transcripts — the running, word-by-word text a captioning or note-taking feature depends on. For an assistant whose whole job is the transcript, that is the wrong default. Either use the STT-LLM-TTS pipeline, which produces text at the first stage, or add a separate STT plugin alongside the realtime model so you still get the words. Picking the architecture by how impressive it sounds rather than what your feature needs is the costliest early mistake here.
Building the meeting assistant specifically
The general pipeline is for agents that converse. A meeting assistant has a narrower, more useful job, and LiveKit's building blocks map onto it directly.
The core is transcription of everyone, not just one speaker. The framework includes a multi-user transcriber pattern — an agent that produces transcripts from all participants in the room at once — which is exactly the spine of a notetaker. Because each participant's audio arrives as a separate track, the assistant transcribes them independently and can attribute lines to the right person; pairing that with speaker identification is the job of speaker diarization. The job's own logs even capture the user transcript and turn-detection data as structured JSON, so you have a record without extra wiring.
Most meeting assistants should listen, not talk. The framework supports a listen-only configuration and a text-only mode, where the same agent code runs without ever speaking — it simply consumes audio and emits text. That keeps the assistant unobtrusive and, as the pricing shows, cheaper, because it skips the text-to-speech stage entirely. When you do want the assistant to answer aloud — a copilot you can ask "what did we decide about the budget?" mid-call — you add the TTS stage back, or use a realtime model paired with a separate transcriber.
Recording is a separate concern with its own tool. LiveKit Egress is the component that records or re-streams a room and can export each participant's track on its own. A notetaker that needs the audio or video saved for later — for compliance, or to re-process the meeting afterward — uses Egress rather than trying to capture media inside the agent. Keeping recording and live transcription as distinct jobs is cleaner and matches how the pricing meters them.
Two more pieces round out a real assistant. Tool calling lets the language model take actions — create a calendar entry, file a task, look something up — when the conversation calls for it, including through the Model Context Protocol that standardizes how models reach external tools. And if your product needs to flag inappropriate content as the meeting runs, that belongs in a real-time moderation pass rather than bolted into the notetaker.
LiveKit pricing in 2026: how the model works
Now the second half of the decision. LiveKit Cloud's pricing looks busy at first because it meters several things separately, but it rests on a few ideas. Understanding the meters is more useful than memorizing the numbers, since the numbers move; everything below was taken from LiveKit's pricing page on 2 June 2026 and should be re-checked before you commit.
There are four published plans. Build is free at $0 a month and needs no credit card. Ship starts at $50 a month. Scale starts at $500 a month. Enterprise is custom-priced. Each plan bundles an allowance of every metered resource and then charges overage rates once you pass the allowance.
The meters that matter for a meeting assistant are three. The first is WebRTC minutes — the connection time of every participant, human or agent, measured per participant per minute. Build includes 5,000 such minutes; Ship includes 150,000 and then charges $0.0005 per minute; Scale includes 1.5 million and then charges $0.0004 per minute. The second meter is agent session minutes — the time your AI agent spends running in calls, billed on top of the connection time. Build includes 1,000 agent minutes; Ship includes 5,000 then $0.01 per minute; Scale includes 50,000 then $0.01 per minute. The third is downstream data transfer — the bandwidth LiveKit sends out to participants, measured in gigabytes. Build includes 50 GB; Ship includes 250 GB then $0.12 per GB; Scale includes 3 TB then $0.10 per GB.
A few limits shape which plan you actually need. Concurrent agent sessions — how many calls can have an assistant at the same moment — are capped at 5 on Build, 20 on Ship, and up to 600 on Scale. Concurrent connections cap at 100, 1,000, and 5,000 across the tiers. So the plan choice is often driven by how many simultaneous meetings you must support, not by total minutes.
Then there is the cost of the AI models themselves. The speech-to-text, language-model, and text-to-speech stages are not free — they are run by providers like Deepgram, OpenAI, or Cartesia. LiveKit can broker these through a feature called Inference and bundles a small credit allowance into each plan (about $2.50 of credits on Build, $5 on Ship, $50 on Scale), after which you pay the per-minute model rates. You can also bring your own provider accounts. Either way, model cost is a real line item separate from the connection and agent meters above.
Figure 3. The 2026 LiveKit Cloud tiers side by side. Pick the tier by your concurrency ceiling first, then check that the included minutes cover your volume before overage starts.
A worked example: what an AI notetaker really costs
Numbers make the model concrete. Suppose your product runs an AI notetaker that joins meetings for a mid-sized customer base. Plan for 2,000 meetings a month, each lasting 45 minutes, each with five human participants plus the one agent.
Start with the connection minutes, because every attendee — including the agent — is billed for time in the room. Count the attendees per meeting first:
5 humans + 1 agent = 6 participants per meeting
Now the total connection time across the month:
2,000 meetings
× 45 minutes
× 6 participants
= 540,000 WebRTC participant-minutes / month
Next the agent's own session time, billed separately and only for the agent:
2,000 meetings
× 45 minutes
× 1 agent
= 90,000 agent session-minutes / month
On the Scale plan, both fit inside the included allowances — 540,000 connection minutes is under the 1.5-million included, and 90,000 agent minutes exceeds the 50,000 included, so 40,000 minutes bill at $0.01:
40,000 overage agent-minutes
× $0.01
= $400
+ $500 monthly base
= $900 / month before model and bandwidth costs
The model cost is the part people forget, and it usually dominates. A listen-only notetaker needs speech-to-text but no text-to-speech, so take a representative streaming STT rate near $0.006 per minute of audio. The agent transcribes the meeting's spoken audio for 45 minutes per call:
2,000 meetings
× 45 minutes
× $0.006
= $540 / month in speech-to-text
Add it up and the notetaker lands near $1,440 a month plus bandwidth — and the lesson is in the proportions, not the total. The connection minutes were effectively free inside the Scale allowance; the agent overage and the per-minute model cost are what move the bill. This is why a listen-only assistant that skips text-to-speech is so much cheaper than a talking copilot: a representative TTS rate of $0.03 per minute would have added roughly $2,700 a month on its own, nearly tripling the cost. We work through these unit economics across features in the real cost of AI in video products.
LiveKit publishes its own per-minute illustration for a different shape of agent — a phone-call voice agent — that makes the layering visible: about $0.01 for the agent session, $0.01 for telephony, $0.0077 for the language model, $0.0058 for speech-to-text, $0.03 for text-to-speech, and $0.01 for observability, totalling roughly $0.0735 per minute. Your numbers will differ with your models and whether the agent speaks, but the structure — a stack of separately metered layers — is always the same.
Figure 4. Where the money goes per minute. Text-to-speech is the heaviest layer; a listen-only notetaker that drops it costs far less than a talking copilot.
Build on LiveKit, or buy a finished service?
LiveKit is a framework, not a finished product. That is its strength and the reason to pause before choosing it. The honest decision has three options, not two.
The first option is to build on LiveKit Cloud. You write the agent, LiveKit runs the infrastructure, and you get full control of the assistant's behavior — what it transcribes, how it summarizes, which models it uses, what actions it can take — while paying the metered rates above. This fits teams whose product is the meeting intelligence, or who need the assistant woven into an app they already control.
The second option is to self-host the open-source server and Agents framework. The software is free under Apache 2.0, so you pay only for your own servers and bandwidth, and your media never leaves infrastructure you operate — which matters for healthcare, defense, or strict data-residency rules. The cost is operational: you run, scale, and secure the media servers yourself. This fits teams with infrastructure capacity and hard privacy requirements.
The third option is to buy a finished meeting-bot service and skip building entirely. Services such as Recall.ai sell a ready-made bot that joins Zoom, Google Meet, or Teams and returns transcripts and recordings through an API, priced around $0.50 per hour of recording — no media engineering on your side. Frameworks like Pipecat from Daily, or hosted voice-agent platforms like Vapi and Retell AI, occupy the middle ground between LiveKit's build-it-yourself control and a turnkey bot. The trade is the usual one: the more finished the product you buy, the less it costs to start and the less you can change.
The decision rule is the same one that governs every build-versus-buy call in real-time video. If the assistant is a commodity feature — "let users get a transcript" — buying a finished bot is almost always faster and cheaper to launch. If the assistant is the product, or must run inside media you control, building on LiveKit (Cloud or self-hosted) is what gives you the ceiling to differentiate. Decide which of those you are before you compare prices, because the cheapest path that cannot do what your product needs is the most expensive choice you can make.
| Option | What you get | 2026 cost shape | Best when |
|---|---|---|---|
| LiveKit Cloud | Full control, hosted infra | Metered: WebRTC + agent min + bandwidth + model rates | The assistant is your product |
| LiveKit self-hosted | Full control, your servers | Free software (Apache 2.0) + your own infra | Strict privacy / data residency |
| Recall.ai (bot SaaS) | Turnkey bot joins Zoom/Meet/Teams | ~$0.50 / recording-hour | Transcripts are a commodity feature |
| Pipecat / Vapi / Retell | Voice-agent frameworks & platforms | Platform fees + model rates | Voice agents, middle-ground control |
Where Fora Soft fits in
We have built real-time video products since 2005 — video conferencing, e-learning, telemedicine, and live collaboration tools — and the meeting-assistant question now lands on almost every conferencing and telemedicine project we scope. We have shipped on managed platforms when speed mattered and on open frameworks like LiveKit when the AI inside the call was the product's value, including agents that join as participants to transcribe and summarize. In telemedicine the same pattern becomes an AI scribe that drafts a visit note; in e-learning it becomes a session recap; in conferencing it becomes the notetaker users now expect. The work we care about is matching the architecture and the pricing tier to the assistant you actually need, so a listen-only notetaker does not get built — and billed — like a talking copilot.
What to read next
- WebRTC + AI: Insertable Streams, Encoded Transform, And The Video SDK Comparison
- Sub-100ms Real-Time Latency Budget Engineered
- Live Captions — SFU-Side ASR Fan-Out Pattern
Talk to us / See our work / Download
- Talk to a LiveKit engineer — bring us your meeting-assistant or voice-agent idea and we will help you choose Cloud, self-host, or buy, and the right pipeline and tier: /livekit-ai-agent-development-experts
- See our work — real-time video and conferencing products we have shipped since 2005: /portfolio
- Download the LiveKit AI Meeting Assistant Architecture & Pricing Cheat Sheet — a one-page planner with the dispatch flow, the three pipeline types, the 2026 Cloud tiers, the per-minute cost formula, and the build-vs-buy checklist: Download the cheat sheet
References
- W3C, "WebRTC: Real-Time Communication in Browsers" (Recommendation, 13 March 2025) — the finished core WebRTC platform standard LiveKit builds on. https://www.w3.org/TR/webrtc/
- IETF RFC 8825 (January 2021), "Overview: Real-Time Protocols for Browser-Based Applications" — the applicability statement defining the WebRTC protocol suite beneath the browser API. https://www.rfc-editor.org/rfc/rfc8825
- IETF RFC 6716 (September 2012), "Definition of the Opus Audio Codec" — the default WebRTC audio codec carried by LiveKit's media path. https://www.rfc-editor.org/rfc/rfc6716
- LiveKit — Agents framework (GitHub, livekit/agents) — "A framework for building realtime voice AI agents"; Agent / AgentSession / AgentServer / entrypoint concepts, plugin ecosystem, semantic turn detection, run modes; ~10.8K stars. https://github.com/livekit/agents
- LiveKit — Media server (GitHub, livekit/livekit) — open-source, Apache 2.0, Go, Pion-based SFU; single-binary / Docker / Kubernetes deployment; ~19K stars. https://github.com/livekit/livekit
- LiveKit Docs — Voice pipeline types — STT-LLM-TTS pipeline vs realtime model vs half-cascade; "for most production agents, an STT-LLM-TTS pipeline is the right default"; realtime models do not produce interim transcripts; sub-one-second latency target. https://docs.livekit.io/agents/models/pipelines/
- LiveKit Docs — Agent dispatch — explicit, token-based, and automatic dispatch; "Dispatch is the process of assigning an agent to a room"; <150 ms max dispatch time; token dispatch only on room creation. https://docs.livekit.io/agents/server/agent-dispatch/
- LiveKit Docs — Job lifecycle — per-job process isolation; entrypoint as job main function; JSON job logs including user transcript and turn-detection data; job runs until participants leave. https://docs.livekit.io/agents/server/job/
- LiveKit — Pricing — Build $0, Ship $50/mo, Scale $500/mo, Enterprise custom; WebRTC minutes (5,000 / 150,000 then $0.0005 / 1.5M then $0.0004), agent session minutes (1,000 / 5,000 then $0.01 / 50,000 then $0.01), downstream transfer (50 GB / 250 GB then $0.12 / 3 TB then $0.10), concurrency caps, Inference credits, and the ~$0.0735/min voice-agent calculator example. Accessed 2026-06-02. https://livekit.io/pricing
- TechCrunch (22 January 2026), "Voice AI engine and OpenAI partner LiveKit hits $1B valuation" — Series C $100M at $1B, led by Index Ventures; powers ChatGPT voice mode; customers xAI, Salesforce, Tesla; founded 2021 by Russ d'Sa and David Zhao. https://techcrunch.com/2026/01/22/voice-ai-engine-and-openai-partner-livekit-hits-1b-valuation/
- TechCrunch (10 April 2025), "LiveKit's tools help power real-time communications" — $45M Series B led by Altimeter; ~25% of US 911 calls; 500+ paying customers, 100,000+ developers; founders' backgrounds (ex-Twitter, ex-Motorola). https://techcrunch.com/2025/04/10/livekits-tools-help-power-real-time-communications/
- LiveKit Blog (18 May 2022), "LiveKit 1.0" — the open-source media server reached version 1.0; simulcast, adaptive stream, Opus DTX. https://blog.livekit.io/livekit-one-dot-zero/
- Recall.ai — Meeting bot API and pricing — turnkey bot that joins Zoom/Meet/Teams for transcripts and recordings; 5 free hours then $0.50/hour. https://www.recall.ai/
- Pipecat (Daily) — Documentation — open-source Python framework for voice and multimodal agents; self-host or Pipecat Cloud. https://docs.pipecat.ai/overview/introduction
- Vapi — Voice AI agent platform — hosted voice-agent platform; $50M Series B; SOC 2 / HIPAA / PCI compliance claims. https://vapi.ai/
- Retell AI — Voice agent platform — hosted LLM voice agents; ~600 ms latency, proprietary turn-taking, SIP trunking; HIPAA / SOC 2 Type II. https://www.retellai.com/


