Published 2026-06-02 · 24 min read · By Nikolay Sapunov, CEO at Fora Soft
Why this matters
Most teams that want an AI notetaker start by gluing together a transcription API and an LLM and quickly hit the hard parts: how the bot actually gets into the call, how to keep each speaker's words separate, when to run the summary, and what it all costs. This lesson removes the guesswork by giving you a reference implementation that already solved those problems the way LiveKit's own examples do, plus the production pieces those examples leave out — dispatch, recording, structured notes, and tests. If you are a founder or product lead, you get a concrete artifact to estimate and scope against. If you are an engineer, you get code you can read top to bottom in one sitting and adapt to your product.
What you are getting
The download below is the whole project, licensed Apache-2.0 like LiveKit itself, verified against livekit-agents 1.x as of June 2026. It is small on purpose — six source files, one test file, and the deployment scaffolding — because the goal is comprehension, not a framework you have to learn.
Before the code, one reminder from the LiveKit pillar lesson: a meeting assistant is not a feature bolted onto one person's screen. It is an extra attendee made of software. It connects to the same room as the humans, listens to the shared audio, and produces its output as text, a recording, or speech. Everything in this repo flows from that single idea. If the LiveKit terms below feel unfamiliar — room, participant, dispatch, agent — read that pillar first; this lesson assumes it.
Here is the whole project at a glance, before we open any single file.
Figure 1. The repository as a runtime. Each box is one source file; the arrows are the order things happen in, from a user joining to notes landing on disk.
The five source files in src/ divide the work cleanly, and the test file pins down the one part worth testing without a network. The table below is the map we will follow for the rest of the lesson.
| File | Responsibility | The one idea to remember |
|---|---|---|
src/agent.py |
Entrypoint — wires everything together | The agent server hands control to your entrypoint for each call |
src/transcriber.py |
One speech-to-text session per participant | Separate sessions keep each speaker's words attributed |
src/notes.py |
Collect lines, summarize once, write files | The summary is a single LLM pass at the end, not during the call |
src/recording.py |
Optional recording via LiveKit Egress | Recording is a separate job from transcription |
src/token_server.py |
Mint a user token and dispatch the agent | Explicit dispatch puts the agent only where you want it |
tests/test_notes.py |
Unit tests for the notes layer | The deterministic core is tested without any network |
The entrypoint: agent.py
Every LiveKit agent has one front door, and this file is it. Think of the front door like a building's reception desk: a long-running program waits there, and each time a meeting needs an assistant, reception hands the visitor — your code — the key to one specific room.
In LiveKit's framework that reception desk is the AgentServer, and the key-handoff is a function you mark with a decorator. A decorator is a one-line label, written with an @ symbol above a function, that tells the framework "call this function when a job arrives." Here is the shape of it, trimmed to the essential lines:
server = AgentServer()
def prewarm(proc: JobProcess) -> None:
# Load the voice-activity model once per process and reuse it.
proc.userdata["vad"] = silero.VAD.load()
server.setup_fnc = prewarm
@server.rtc_session(agent_name="meeting-notetaker")
async def entrypoint(ctx: JobContext) -> None:
...
if __name__ == "__main__":
cli.run_app(server)
Three details in that small block carry real weight. First, prewarm runs once when a worker process starts, before any meeting is assigned to it, and loads a small model called voice activity detection — VAD for short, a lightweight model that simply decides whether someone is speaking right now. Loading it once and reusing it across every participant in that process saves startup time on every call. The framework runs each job in its own separate operating-system process for isolation, so this warm-up pays off per process.
Second, the agent_name="meeting-notetaker" argument is not cosmetic. In LiveKit, naming an agent in the session decorator flips it into explicit-dispatch mode: the agent will only join rooms you specifically send it to, never every room automatically. That is exactly what a meeting product wants, and it is why the token server later has to ask for the agent by that same name.
Third, cli.run_app(server) is what turns the file into a command-line program with three modes. Running python src/agent.py console lets you talk to the agent in your terminal with no front end at all. Running it with dev starts a worker against your LiveKit server with hot-reloading and readable logs. Running it with start is the production mode the Dockerfile uses, emitting machine-readable JSON logs.
Now the entrypoint body — the code that runs for each meeting:
@server.rtc_session(agent_name="meeting-notetaker")
async def entrypoint(ctx: JobContext) -> None:
store = TranscriptStore(room_name=ctx.room.name)
transcriber = MultiUserTranscriber(ctx, store, stt_model=STT_MODEL)
egress_id = await start_room_recording(ctx.room.name)
# Subscribe to audio only — a note-taker never needs the video tracks.
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
transcriber.start()
async def on_shutdown() -> None:
await transcriber.aclose()
await stop_recording(egress_id)
if store.is_empty:
return
summary = await summarize_transcript(store)
write_notes(NOTES_DIR, store, summary)
ctx.add_shutdown_callback(on_shutdown)
Read it as a short story. The agent creates an empty notebook (the store), hires a transcriber that will write into it, and optionally starts a recorder. It then connects to the room asking only for audio — never video, because a notetaker has no use for the picture, and skipping video saves bandwidth you would otherwise pay for. It starts transcribing. Finally it registers a shutdown callback: a function the framework promises to run when everyone has left the room. That callback closes the transcriber, stops the recorder, and — only if anything was actually said — runs the summary and writes the notes. The real file adds a try/except around the summary so a failed summary never costs you the raw transcript; losing the words because the summary broke would be the worst possible outcome.
The shape to internalize is that the agent does almost nothing clever during the call. It just collects. The intelligence happens once, at the end. That choice has cost and quality consequences we return to later.
The heart: transcriber.py
This is the file that makes the notetaker a notetaker, and it is worth slowing down for. The problem it solves sounds simple and is not: in a call with five people, how do you end up with five clearly separated streams of "who said what," rather than one jumbled transcript?
The naive approach — point one transcriber at the room's mixed audio — gives you a wall of text with no reliable way to tell speakers apart. You would then need a second system, called speaker diarization, to guess who spoke each line after the fact; we cover that harder path in the Pyannote diarization lesson. This repo sidesteps that entirely with a cleaner trick: give every participant their own transcriber.
That works because of how LiveKit delivers audio. Each participant's microphone arrives as its own separate stream, called a track, forwarded by the media server we describe in the WebRTC and video SDK comparison. Because the streams are already separate at the source, you can attach one speech-to-text session to each one, and every line it produces is automatically attributed to the right person — no guessing required. This is the canonical LiveKit "multi-user transcriber" pattern, and the repo follows the official example closely, then adds the piece the example omits: forwarding the finalized lines into shared storage.
Figure 2. Why one session per participant beats one session for the room. Separate tracks in means already-labeled lines out, with no separate "who spoke?" step.
The per-participant agent itself is tiny. It is a speech-to-text-only agent that, every time a speaker finishes a thought, records the line and then deliberately stays silent:
class ParticipantTranscriber(Agent):
def __init__(self, *, participant_identity, store, stt_model):
super().__init__(instructions="not-needed", stt=inference.STT(stt_model))
self.participant_identity = participant_identity
self._store = store
async def on_user_turn_completed(self, chat_ctx, new_message) -> None:
text = (new_message.text_content or "").strip()
if text:
self._store.add(speaker=self.participant_identity, text=text)
# Suppress the default reply: this agent only listens.
raise StopResponse()
The single most important line is raise StopResponse(). By default, a LiveKit agent that hears a completed turn will try to answer it — run a language model and speak back. For a notetaker that would be a disaster: the bot would start replying to everyone. Raising StopResponse is how you tell the framework "I heard it, I am done, do not generate a reply." It is what converts a conversational agent into a pure listener. The inference.STT(stt_model) call routes the audio through LiveKit Inference, the brokered way to reach a provider like Deepgram, with the model named in one environment variable (deepgram/nova-3 by default); the streaming ASR lesson compares the provider options.
The surrounding MultiUserTranscriber class is the manager. It listens for two room events — someone joined, someone left — and keeps a dictionary of one transcription session per person:
def start(self) -> None:
self.ctx.room.on("participant_connected", self._on_participant_connected)
self.ctx.room.on("participant_disconnected", self._on_participant_disconnected)
# Handle anyone already in the room when the agent joins.
for participant in self.ctx.room.remote_participants.values():
self._on_participant_connected(participant)
That last loop matters more than it looks. Events only fire for the future; if three people are already in the room when the agent arrives, no "connected" event ever fires for them. Iterating the existing participants on start is how the agent picks up everyone who was already there. Forgetting that loop is a classic bug — the agent transcribes latecomers perfectly and silently ignores whoever was early.
When a participant joins, the manager starts a session for them. The session options are where the listen-only nature is enforced:
await session.start(
agent=ParticipantTranscriber(...),
room=self.ctx.room,
room_options=room_io.RoomOptions(
audio_input=True, # listen to this participant
text_output=True, # publish live captions back to the room
audio_output=False, # never speak — this is a note-taker
text_input=False,
participant_identity=participant.identity,
),
)
Read those four flags as the agent's contract with the room. It takes audio in, it may publish text out (so the meeting can show live captions if the UI wants them), it never produces audio out, and it is bound to exactly one participant. Set audio_output=True and you would have a talking bot; leaving it False is the whole point. When the meeting ends, aclose() cancels the background tasks and drains each session cleanly so no half-written line is lost.
The intelligence, applied once: notes.py
This file holds two responsibilities that are easy to confuse: collecting the transcript as the call runs, and summarizing it once at the end. Keeping them separate is what keeps the design cheap and predictable.
The collector is a small in-memory store. Each line records who said it, what they said, and how many seconds into the meeting they said it:
@dataclass
class TranscriptLine:
speaker: str
text: str
t: float # seconds since the meeting started
def add(self, speaker: str, text: str) -> None:
self.lines.append(
TranscriptLine(speaker=speaker, text=text, t=time.monotonic() - self.started_at)
)
The timestamp uses a monotonic clock — a clock that only ever counts forward and is immune to the system time being adjusted mid-call — so elapsed times never go backward even if the server's wall clock changes. The store also exposes the de-duplicated list of speakers and a plain-text rendering of the whole transcript, used both as the prompt for the summary and as the human-readable block in the notes file.
The summary is one language-model call, made after the meeting, not during it. This is the design decision with the biggest cost impact in the whole repo, so it is worth stating plainly. The agent does not ask an LLM to "keep summarizing" as people talk. It waits until the call is over, then sends the entire transcript through the model exactly once and asks for structured output:
resp = await client.chat.completions.create(
model=model,
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": SUMMARY_SYSTEM_PROMPT},
{"role": "user", "content": user_prompt},
],
)
The response_format={"type": "json_object"} setting asks the model to return strict JSON rather than free prose, so the result drops straight into the rest of the program without fragile text parsing. The prompt names exactly five keys it wants back — a short overview, the topics, the decisions, the action items (each with an owner and an optional due date), and any open follow-ups — and instructs the model never to invent facts the transcript does not support. The code still wraps the parse in a guard: if the model ever returns something that is not valid JSON, the line json.loads(raw) fails gracefully and the raw text is kept rather than crashing the run.
Figure 3. One pass, two outputs. The transcript becomes a structured summary in a single call, then lands as both Markdown for people and JSON for systems.
Writing the output is the last step. The repo produces two files per meeting, named by room and timestamp: a Markdown file with sections a human will read — Summary, Decisions, Action items, Follow-ups, and the full transcript — and a JSON file carrying the same content in a shape another program can consume. Producing both is deliberate: the Markdown serves the person who attended; the JSON serves the system that files the action items into your task tracker. The filename is sanitized so a room called Q3 / planning cannot escape the notes folder, a small safety habit worth copying.
The optional extra: recording.py
Recording feels like it belongs inside the transcriber, and one of the cleanest decisions in this repo is that it does not. Transcription and recording are kept as two separate jobs, because that is how LiveKit is built and how LiveKit Cloud charges for them.
Recording uses a LiveKit component called Egress — the part of the system whose job is to export or re-stream what happens in a room. The agent never tries to capture media itself; it asks Egress to record the room composite (everyone, mixed into one file) and to upload it to your own storage:
request = api.RoomCompositeEgressRequest(
room_name=room_name,
layout="speaker",
audio_only=False,
file_outputs=[file_output], # note: a list, not a single 'output'
)
async with api.LiveKitAPI() as lkapi:
info = await lkapi.egress.start_room_composite_egress(request)
return info.egress_id
The whole feature is gated behind one environment variable, RECORD_MEETINGS, and switched off by default, because recording carries privacy and consent obligations you should opt into deliberately, not by accident. If it is on but no storage bucket is configured, the code logs a warning and skips recording rather than crashing the meeting — a notetaker that records nothing is still useful; one that dies because a bucket was missing is not.
Common pitfall: the missing
outputfield. A real bug that has bitten many LiveKit teams is constructing the recording request with a singleoutput=field and watching LiveKit Cloud reject it withinvalid_argument: missing field: output. The current, correct API takes a list —file_outputs=[...]— and this repo uses the list form on purpose. If you adapt the code and recording fails on Cloud with that exact error, you have reverted to the old singular field. The same separation-of-concerns logic applies to a subtler trap: do not try to grab audio frames inside the agent to "save a recording" — that mixes two billed meters and two failure modes into one fragile path. Let Egress record; let the agent transcribe.
A nice property worth knowing: a room-composite recording is tied to the room's lifecycle and stops on its own when the last participant leaves. The repo still calls stop_recording explicitly on shutdown, which is harmless belt-and-suspenders and makes the intent obvious to whoever reads the code next.
Getting the agent into the call: token_server.py
So far the agent can run, but nothing has put it into a specific meeting. That is the job of this last source file, and it does two things in one HTTP request.
A user's app needs two things to join a LiveKit room: the server's address and a signed access token — a short-lived credential, like a numbered ticket, that says "this person may join this specific room." The token server mints that ticket. But in the same request, before handing back the ticket, it also dispatches the notetaker into that room:
@app.get("/token")
async def token(room: str, identity: str) -> dict:
# 1. Dispatch the note-taker agent into the room.
async with api.LiveKitAPI() as lkapi:
await lkapi.agent_dispatch.create_dispatch(
api.CreateAgentDispatchRequest(agent_name=AGENT_NAME, room=room)
)
# 2. Mint a join token for the human user.
grant = api.VideoGrants(room_join=True, room=room)
jwt = (
api.AccessToken(os.environ["LIVEKIT_API_KEY"], os.environ["LIVEKIT_API_SECRET"])
.with_identity(identity).with_name(identity).with_grants(grant).to_jwt()
)
return {"serverUrl": os.environ["LIVEKIT_URL"], "roomName": room, "token": jwt}
This is explicit dispatch, and it is the recommended pattern for a reason explained in the LiveKit pillar: it sends the agent only to rooms you choose, and lets you attach metadata — which meeting this is, who the host is — so the agent knows its context. The alternative, token-based dispatch, only fires when a room is first created and silently does nothing if the room already exists; explicit dispatch has no such trap. Note the API key and secret are read from environment variables and never returned to the browser — the server signs the token, the client only receives it.
Your front end calls this one endpoint, receives the server URL and token, and connects. The notetaker is already on its way in, typically placed in the room in well under a fifth of a second. From the user's point of view, the assistant is simply there when the call starts.
Proving the core works: tests/test_notes.py
A note on testing, because it reflects how to think about reliability in an AI feature. You cannot easily unit-test the parts that depend on a live network and a paid model — the transcription and the LLM summary. But the parts that must never break — collecting lines, attributing speakers, rendering the notes file — are pure logic, and those are exactly what the test file pins down:
def test_render_markdown_has_sections():
summary = {
"summary": "The team agreed to ship on Friday.",
"decisions": ["Ship the note-taker on Friday."],
"action_items": [{"owner": "bob", "task": "Write the deployment guide.", "due": None}],
"follow_ups": [],
}
md = render_notes_markdown(_store(), summary)
assert "## Summary" in md
assert "**bob**" in md
The tests run with no API keys and no internet, in well under a second. That split — test the deterministic core hard, treat the model as an external dependency you mock or skip — is the right default for any AI feature, and it is why the notes layer was written as plain functions over a plain data structure rather than tangled into the live agent.
Running it, and what it costs
Five steps take you from clone to talking to the agent. Install the dependencies with uv sync, copy .env.example to .env.local and fill in your LiveKit, Deepgram, and OpenAI keys, download the VAD model weights once with python src/agent.py download-files, then run python src/agent.py console to talk to it in your terminal or python src/agent.py dev to run it as a worker against your server. Join the room from any LiveKit front end, talk, and leave — the notes appear under notes/.
Figure 4. From your laptop to production. The same code runs in three local modes and deploys three ways; the cost meters at the bottom are what you actually pay.
Deploying to LiveKit Cloud is two commands once the livekit.toml and Dockerfile are in place: lk agent create the first time, which registers the agent and writes its ID into livekit.toml, then lk agent deploy for every version after. LiveKit Cloud builds the container image for you and runs it. Or build the Docker image yourself and run it on any infrastructure with docker run --env-file .env.local meeting-notetaker.
The cost shape is the part to plan around, and because this is a listen-only notetaker it is the cheap shape. Three meters apply, plus the speech-to-text model. Work an example: 2,000 meetings a month, 45 minutes each, five humans plus the one agent.
Connection time is billed for every attendee, agent included:
2,000 meetings × 45 minutes × 6 participants
= 540,000 WebRTC participant-minutes / month
The agent's own running time is billed separately, for the agent only:
2,000 meetings × 45 minutes × 1 agent
= 90,000 agent session-minutes / month
And the speech-to-text, at a representative streaming rate near $0.006 per minute of audio transcribed:
2,000 meetings × 45 minutes × $0.006
= $540 / month in speech-to-text
The lesson is in the proportions, not the total. Because the agent never speaks, there is no text-to-speech stage — and text-to-speech is normally the single heaviest cost layer, often near $0.03 per minute. Dropping it is why a listen-only notetaker can cost less than half of what a talking copilot would. The full tier-by-tier pricing, with the Build, Ship, and Scale allowances, lives in the LiveKit architecture and pricing pillar, and we model these unit economics across features in the real cost of AI in video products.
Build on this, or buy a finished bot?
This repo is the "build" answer, and it is the right one when the assistant is part of your product or must run inside media you already control — your own conferencing app, a telemedicine platform, an e-learning room. You own the behavior, the models, and the data path.
If all you need is "let users get a transcript of their Zoom call," buying a finished meeting-bot service such as Recall.ai — a turnkey bot that joins Zoom, Google Meet, or Teams and returns transcripts through an API for around $0.50 per recording-hour — is faster and cheaper to launch than running any infrastructure. The trade is the usual one: the more finished the product you buy, the less you can change. This repo exists for the case where changing it is the point.
| Path | What you do | Best when |
|---|---|---|
| This repo on LiveKit Cloud | Run the agent; LiveKit hosts the media | The notetaker lives inside your own app |
| This repo self-hosted | Run the open-source server and the agent | Strict privacy or data-residency rules |
| Buy a meeting bot (e.g. Recall.ai) | Call an API; a bot joins Zoom/Meet/Teams | A transcript is a commodity add-on |
Where Fora Soft fits in
We have built real-time video products since 2005 — video conferencing, e-learning, telemedicine, and live collaboration — and the notetaker is now a near-default request on conferencing and telemedicine projects we scope. We have shipped agents that join calls as participants to transcribe and summarize, and the architecture in this repo is the same one we reach for when the meeting intelligence has to live inside a client's own product rather than a third-party bot. In telemedicine the same pattern becomes an AI scribe that drafts a visit note; in e-learning it becomes a session recap; in conferencing it is the notetaker users now expect. The engineering judgment we care about is matching the design to the need — keeping a notetaker listen-only so it is not built, or billed, like a talking copilot.
What to read next
- LiveKit Real-Time AI Meeting Assistant — Architecture And Pricing
- Streaming ASR In Production — Deepgram, Whisper, AssemblyAI
- Live Captions — SFU-Side ASR Fan-Out Pattern
Talk to us / See our work / Download
- Talk to a LiveKit engineer — bring us your notetaker, scribe, or meeting-copilot idea and we will help you shape the agent, the pipeline, and the deployment: /livekit-ai-agent-development-experts
- See our work — real-time video and conferencing products we have shipped since 2005: /portfolio
- Download the full repository — the complete, runnable LiveKit meeting note-taker project (six source files, tests, Dockerfile, and deployment config), Apache-2.0: Download the repo (.zip) (full repository available on GitHub). A one-page repo cheat sheet (PDF) summarizes the six files, the quickstart, the listen-only rule, and the build-vs-buy checklist.
References
- W3C, "WebRTC: Real-Time Communication in Browsers" (Recommendation, 13 March 2025) — the finished core WebRTC platform standard LiveKit and this agent build on. https://www.w3.org/TR/webrtc/
- IETF RFC 8825 (January 2021), "Overview: Real-Time Protocols for Browser-Based Applications" — the applicability statement defining the WebRTC protocol suite beneath the browser API. https://www.rfc-editor.org/rfc/rfc8825
- IETF RFC 6716 (September 2012), "Definition of the Opus Audio Codec" — the default WebRTC audio codec carried on each participant's track that this agent subscribes to. https://www.rfc-editor.org/rfc/rfc6716
- LiveKit — Agents framework (GitHub, livekit/agents) — "A framework for building realtime voice AI agents";
AgentServer,@server.rtc_session,entrypoint, prewarm, run modes. https://github.com/livekit/agents - LiveKit Agents — Multi-user transcriber example (
examples/other/transcription/multi-user-transcriber.py) — the canonical per-participant transcription pattern this repo follows: oneAgentSessionper participant,inference.STT,on_user_turn_completed,StopResponse,room_io.RoomOptions. https://github.com/livekit/agents/blob/main/examples/other/transcription/multi-user-transcriber.py - LiveKit Docs — Server options —
AgentServerparameters, the@server.rtc_session(agent_name=...)entrypoint decorator (naming requires explicit dispatch),setup_fncprewarm pattern,ServerType.ROOMvsPUBLISHER,console/dev/startmodes. https://docs.livekit.io/agents/server/options/ - LiveKit Docs — Agent dispatch — explicit vs token vs automatic dispatch; explicit dispatch via
agent_dispatch.create_dispatch; token dispatch only fires on room creation; sub-150 ms placement. https://docs.livekit.io/agents/server/agent-dispatch/ - LiveKit Docs — RoomComposite & web egress — recording the room composite; recording is tied to room lifecycle and stops when participants leave; the
output/file_outputsfield requirement. https://docs.livekit.io/home/egress/room-composite/ - LiveKit Python SDK —
egress_serviceAPI —start_room_composite_egress(RoomCompositeEgressRequest) -> EgressInfo;audio_only, layout, and output options. https://docs.livekit.io/python/livekit/api/egress_service.html - LiveKit Docs — Agent deployment quickstart —
lk agent createregisters the agent and writeslivekit.toml;lk agent deploybuilds and ships the image; Cloud build service builds from the Dockerfile. https://docs.livekit.io/deploy/agents/quickstart/ - LiveKit Docs — Pipeline types — STT-LLM-TTS vs realtime vs half-cascade; realtime models do not emit interim transcripts; sub-one-second latency target. https://docs.livekit.io/agents/models/pipelines/
- LiveKit — Pricing — Build $0, Ship $50/mo, Scale $500/mo; WebRTC minutes, agent session minutes, and downstream transfer metered separately; model costs billed on top. Accessed 2026-06-02. https://livekit.io/pricing
- OpenAI API — Chat Completions / JSON mode —
response_format={"type": "json_object"}constrains the model to valid JSON, used for the structured end-of-meeting summary. https://platform.openai.com/docs/guides/structured-outputs - Recall.ai — Meeting bot API and pricing — turnkey bot that joins Zoom/Meet/Teams for transcripts and recordings; the "buy" alternative to this repo. https://www.recall.ai/


