Published 2026-06-02 · 24 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

Meetings are where a copilot agent earns its keep in a hurry, because the work around a meeting — taking notes, hunting for the right file, remembering who promised what, writing the recap — is exactly the dull, error-prone work people most want off their plate. If you build conferencing, sales, telemedicine, or e-learning software, your users now expect this help inside the call, not bolted on afterward, and they will judge your product by how useful and how trustworthy the copilot feels. You do not have to write the pipeline yourself, but you do need to know enough to tell a sound design from a reckless one — to ask whether the agent answers fast enough to be welcome, whether it asked permission to listen, and whether a human approves before it sends anything. This is the second of three applied agent lessons; it reuses the same skeleton as the video investigator agent, pointed at a live conversation instead of recorded footage.

What A Meeting Copilot Agent Is — And Is Not

Start with the thing most people already know, because the copilot is often confused with it. A meeting notetaker is a tool that records a call, transcribes it, and writes a summary afterward — the job done by Otter, Fireflies, and the recap feature built into most conferencing apps. It is useful, and we cover that whole landscape in the AI meeting transcription lesson. But a notetaker is passive and works mostly after the fact: it listens, then it reports. It does not decide to do anything.

A meeting copilot agent is the active version. The word agent points at a specific difference: it does not just transcribe the meeting, it pursues a goal during the meeting by taking actions. When the conversation turns to last quarter's numbers, the copilot retrieves them and shows them. When someone says "let's have Maria own the rollout by Friday," the copilot writes that down as a task assigned to Maria, due Friday, and offers to put it in the project tracker. When the call ends, it has already drafted the follow-up email. The notetaker remembers the meeting; the agent helps run it and acts on it.

It also helps to fix the boundary on the other side. A copilot agent is not an autonomous bot that joins your calls and makes decisions on its own. The whole design rests on a simple rule we will return to: the agent proposes, a person decides. It can draft the email, but a human presses send; it can suggest the answer to an objection, but a salesperson chooses whether to use it. In 2026 this active-but-supervised shape is what shipped at scale — Microsoft's Facilitator agent in Teams takes notes, keeps light time, and turns spoken commitments into assigned tasks, while Zoom's AI Companion 3.0 detects action items during the call and can schedule the follow-up — and in both, a human stays in control of anything that leaves the room.

A labeled diagram contrasting three jobs in a video call — a passive notetaker that records and summarizes after the meeting, a silent suggestion feed that surfaces passive prompts, and a meeting copilot agent that plans, calls tools, and acts during and after the call — with the copilot agent highlighted as the subject of this lesson Figure 1. Three things people call "AI in the meeting." A notetaker reports after the fact; a suggestion feed offers passive prompts; the copilot agent plans, calls tools, and acts. This lesson is about the third.

The Copilot Loop

Every agent runs a loop — perceive, reason, act, observe — and we drew that loop in the agent-loop lesson. The meeting copilot is that loop pointed at a live conversation, and it turns many times a minute. Walk one turn slowly and the loop becomes concrete.

The copilot perceives by listening: a speech-to-text tool turns the live audio into a running transcript, tagged by speaker, so the agent always has the last few sentences of the conversation in front of it. It reasons about that transcript with one question — is there something useful I could do right now? Most of the time the answer is no, and the copilot stays quiet, which is itself a skill. When the answer is yes, it plans the smallest helpful step: "the rep was just asked about the integration with Salesforce, and there is a one-line answer in our docs — retrieve it." It acts by calling a tool, the retriever, with that query. The result — the doc snippet — lands back in the agent's working memory as a new observation. The copilot reasons again: is this worth surfacing, and to whom? It shows the snippet quietly to the rep, not to the whole room. The loop turns again on the next sentence.

Notice what each turn needed. It needed planning to choose whether and what to do, tool use to retrieve or write, and memory to hold the goal of the meeting and recall the context. Those are exactly the three primitives from the agent primitives lesson; the meeting copilot is where they run in real time, under a clock. And the loop always ends the same way for anything consequential: the agent writes a draft and a person approves it. The copilot may propose the follow-up email, the CRM update, the assigned task — but it does not send, save, or assign on its own.

A circular agent-loop diagram for a live meeting copilot showing the steps listen via speech-to-text, decide whether to help, plan the smallest useful step, call a tool to retrieve or write, observe the result, and consult memory, looping many times a minute, with the loop exiting to a draft that a human approves before anything is sent Figure 2. One meeting as a fast loop. The copilot listens, decides whether to act, plans the smallest step, calls a tool, and remembers context — turning many times a minute — then drafts a result a person approves before anything leaves the room.

The Tool Belt — What The Copilot Can Actually Do

An agent is only as capable as the tools you give it, and the art is giving it few, sharp ones rather than many dull ones. A meeting copilot needs a small belt, and most of the tools are pieces we covered earlier in this section.

The first tool is the live transcriber — a streaming speech-to-text model that turns audio into text as it is spoken, and tags each line with who said it (a step called speaker diarization). This is the copilot's ear, and it never stops running; everything else depends on it. The production options are the streaming engines from the streaming ASR lesson.

The second tool is the retriever — a way to pull a fact the meeting needs out of the company's knowledge: the CRM record for this customer, the slide that has last quarter's numbers, the support ticket that is being discussed, the answer in the product docs. This is the same retrieval machinery as multimodal RAG over an archive, pointed at documents and records instead of video. The retriever is what lets the copilot answer "what did we quote them in Q3?" without anyone leaving the call.

The third tool is the action-item writer — the agent detecting that a commitment was made ("Maria owns the rollout, Friday") and turning it into a structured task with an owner and a due date. On its own this is just text; its value comes from the next tool.

The fourth tool is the task and calendar connector — a write action into the systems the team actually uses: create the task in the project tracker, update the deal stage in the CRM, propose a follow-up meeting on the calendar. This is the tool that turns talk into done, and it is also the most dangerous one, because it changes real systems — which is why every write goes through a human's approval.

The fifth tool is the responder, used only by copilots that speak or chat back. For a spoken copilot, this is a text-to-speech voice (the streaming engines from the TTS lesson); for a silent one, it is a card in the sidebar. This is how the copilot communicates a result, and whether it speaks aloud to everyone or whispers to one person is a design choice with real consequences, covered below.

The sixth tool is the recap writer — the agent assembling the transcript, the decisions, the action items, and the open questions into a short summary after the call. This is not a real-time tool at all; it runs once, at the end, when there is no clock to beat. It is the notetaker's old job, now done by the same agent that helped during the meeting.

Tool What it does When it runs Built from
Live transcriber Speech to tagged text, live Continuously Streaming ASR + diarization
Retriever Pull a fact from CRM / docs / past calls On demand, live Multimodal RAG
Action-item writer Spot a commitment, structure it On demand, live LLM extraction
Task / calendar connector Write to tracker, CRM, calendar After human approval App integrations
Responder (voice or card) Communicate a result When the copilot helps Streaming TTS or UI
Recap writer Summarize decisions + actions Once, after the call LLM summarization

A common and expensive mistake is tool overload: handing the agent thirty connectors because each one seemed useful. The more tools an agent can choose from, the more often it picks the wrong one or stalls deciding, and accuracy on tool selection drops as the menu grows. Start with the six above, and add a seventh only when real meetings prove it is missing.

A tool-belt diagram laying out the six tools a meeting copilot agent uses — live transcriber, retriever, action-item writer, task and calendar connector, responder, and recap writer — each shown with what it does and whether it runs continuously, on demand during the call, or once after the call Figure 3. The copilot's tool belt, grouped by when each tool runs: the transcriber runs continuously, the retriever and writers run on demand during the call, and the recap writer runs once at the end. The write tools fire only after a human approves.

The Real-Time Constraint — Why The Clock Decides The Design

The single most important idea in this lesson is that a meeting copilot lives under a clock, and the clock decides what the agent can do live versus what it must save for later. Understanding the budget is what separates a copilot people welcome from one they mute.

Do the arithmetic on a spoken copilot — one that talks back in the call. In natural human conversation, the gap between one person finishing and the next person starting is about two-tenths of a second. People notice a pause past roughly half a second, and once the delay passes about one second they start talking over the other party. So a copilot that speaks has a budget of about one second, end to end, and that second has to cover four steps in sequence. The agent has to notice you finished speaking and turn your audio into text (speech-to-text, with end-of-turn detection, ≈ 300 ms), think of a reply (the language model's time to its first word, ≈ 500 ms), start turning that reply into audio (text-to-speech first sound, ≈ 250 ms), and get the audio across the network (≈ 150 ms). Add them up: 300 + 500 + 250 + 150 = 1,200 ms. That is already past the one-second line where people start interrupting — and it is why this is hard.

This is not a worst case; it is roughly the middle of the road. Across millions of real voice-agent calls in 2025–2026, the published median total latency for this kind of cascade — three separate models chained speech-to-text, then language model, then text-to-speech — sits around 1.4 to 1.7 seconds, with the slowest one-in-a-hundred turns running 3 to 5 seconds. There are two ways out, and both matter for design. The first is a single speech-to-speech model that takes audio in and gives audio out without the three-step relay; these reach 320 to 800 milliseconds end to end, comfortably inside the budget, and we cover them in the speech-to-speech lesson. The second is to make most copilots silent — they surface a card, not a voice — so the one-second rule does not apply at all, and the only thing the user feels is a suggestion appearing a beat after the relevant sentence.

Two more real-time details decide whether a spoken copilot feels human. The first is turn detection — knowing when you have actually finished talking rather than just paused to breathe. The 2026 answer combines a fast acoustic check that hears silence with a small language model (LiveKit ships a roughly 135-million-parameter one) that reads the words so far and predicts whether the sentence sounds finished. The second is barge-in — letting the user interrupt. When the user starts speaking while the copilot is talking, the system has to stop its own speech instantly, throw away the half-finished answer, and listen. A copilot that keeps talking over you is worse than no copilot at all.

The clock also splits the work in two, and this split is the whole production design. The in-meeting lane — transcription, retrieval, the occasional live prompt — runs under the one-second budget and only does the cheap, fast things. The after-meeting lane — the full recap, the polished follow-up email, the CRM updates, the long-form action plan — runs once the call ends, with no clock, where the agent can take its time and use a bigger, slower model. Trying to do after-meeting-quality work during the meeting is the most common way to build a copilot that lags and frustrates. You can verify the price side of running models continuously against the real cost of AI in video products; the budget lesson there is the cost twin of the latency lesson here.

In-meeting (live) After-meeting (batch)
Time budget Under ~1 second per turn Minutes — no clock
What runs Transcribe, retrieve, short prompts Full recap, follow-up email, CRM writes
Model size Small and fast Large and thorough
If it speaks Must beat the 1 s turn-taking line N/A — text, read later
Failure mode if mixed up Copilot lags and talks over people None — this is the right place for heavy work

A latency-budget diagram showing a left-to-right waterfall of the four steps a spoken copilot must complete within one second — speech-to-text about 300 milliseconds, language model first token about 500 milliseconds, text-to-speech first audio about 250 milliseconds, and network about 150 milliseconds, summing to 1200 milliseconds past the one-second turn-taking line — alongside a faster speech-to-speech bar at 320 to 800 milliseconds, and a divider separating the in-meeting live lane from the after-meeting batch lane Figure 4. The spoken-copilot latency budget. A three-step cascade lands around 1,200 ms — past the ~1,000 ms line where people start talking over the agent. A single speech-to-speech model fits the budget; making the copilot silent removes the clock entirely.

Memory — Why The Copilot Has To Remember

A copilot that forgets everything between meetings is a transcription tool, not an agent. The difference is memory, and a meeting copilot needs all four kinds we named in the agent primitives lesson.

Its working memory is the meeting happening right now — the running transcript, the decisions made so far, the questions still open — and it lives in the model's context window for the length of the call. Its semantic memory is the standing knowledge the meeting needs: who this customer is, what they bought, the deal stage in the CRM, the product facts in the docs. Its episodic memory is the record of past meetings with these same people — "last call they pushed back on price; the quote we sent was $48,000" — so the copilot walks in already knowing the history. Its procedural memory is the playbook: the standing routine the copilot follows for, say, a sales discovery call versus a customer support call versus a project standup. The first two make the copilot competent in a single meeting; the second two are what make it feel like it has been working this account for months.

Episodic memory is the one teams most often skip, and skipping it is why so many meeting tools feel amnesiac: every call starts from zero, and the human has to re-explain the context the tool should already hold. A copilot that reads back "in your last two calls, the open question was the security review — has that closed?" is doing the one thing a sharp human assistant would do — remembering the relationship, not just the meeting.

The Hard Part — What The Law Will Not Let The Copilot Do

Conferencing is where agent engineering meets privacy law head-on, because the copilot is listening to people talk, and in many places you cannot record people without their permission. Getting this wrong is not a minor bug. There are three lines to respect, and a sound design clears all three by construction rather than by hoping.

The first line is recording consent. In the United States, federal law lets one party to a conversation consent to recording it, but eleven states — including California, Florida, Illinois, Pennsylvania, and Washington — require that every participant consents. California's Invasion of Privacy Act (Penal Code §§ 631 and 632) is the strict template, and its penalties are real: $5,000 per violation, or three times the actual damages. The catch that surprises engineers is that a recording bot simply appearing in the participant list is not treated by any court as sufficient consent — silence is not agreement. This is not theoretical in 2026: the consolidated In re Otter.ai privacy litigation in the Northern District of California, with a motion-to-dismiss hearing on 20 May 2026, turns on exactly this question of whether a notetaker recorded people who never agreed. The safe design asks for clear, affirmative consent from everyone before recording starts, and logs it.

The second line is platform rules, which tightened sharply in early 2026 precisely because of the consent problem. In March 2026, Google Meet began flagging third-party notetaker bots as a potential risk and defaulting to deny their entry; the same month, Microsoft notice MC1251206 announced that Teams would label external meeting bots "Unverified" in the lobby and require the organizer to admit them explicitly, rolling out to general availability by mid-2026. The engineering consequence is concrete: the era of quietly joining a call as an unannounced bot is closing. The durable path is to build on the platform's official media interfaces — the way Read.ai launched on the Google Meet Media API in March 2026 — or to put the copilot inside your own conferencing app, where you control consent and disclosure directly.

The third line is the EU AI Act, and it has two parts that bear on meetings. The transparency rule, Article 50, requires that people be told when they are interacting with an AI system, and separately requires that anyone subjected to an emotion-recognition system be informed of it; these obligations apply from 2 August 2026. The far sharper rule is Article 5, which prohibits AI that infers people's emotions in the workplace, with narrow exceptions only for medical or safety reasons — a ban already in force since February 2025, carrying fines up to €35 million or 7% of global turnover. For a meeting copilot this draws a bright line. Transcribing what was said, retrieving a fact, and writing the action items is ordinary, allowed help. Scoring how engaged, stressed, or persuaded the participants seem from their faces or voices — "the prospect sounded frustrated," "this employee looks disengaged" — is emotion recognition in a workplace, and in the EU it is off the table. The same biometric boundary is drawn in detail in the face detection under the EU AI Act lesson; a copilot should respect it by design, staying at the level of what was said and decided, never how people felt.

Above all three lines sits the design principle that makes them easier to satisfy: the agent proposes, a person decides. The copilot drafts the email but a human sends it; suggests the task but a human assigns it; surfaces the answer but a human chooses to use it. This keeps a person responsible for every consequential action — which is exactly what regulators and customers want — and it is why every figure in this lesson ends on an approval step rather than an automatic one.

What the copilot does The rule that applies Practical design
Records the call All-party consent in 11 US states; bot presence ≠ consent Ask everyone, get clear yes, log it
Joins the meeting Platform bot controls (Google Meet, Teams, 2026) Use official media APIs or your own app
Tells people it is AI EU AI Act Article 50 transparency (from 2 Aug 2026) Disclose the AI and the recording up front
Reads participants' emotions EU AI Act Article 5 — prohibited in the workplace Do not build it; stay at words and decisions
Sends, assigns, or updates Removes human oversight Always gate behind a human's approval

A Worked Meeting, End To End

Tie the pieces together with one sales call. The copilot joins, and the first thing it does is disclose itself and ask all participants to consent to recording — the call does not start listening until they agree. Once it has consent, the live transcriber runs continuously, tagging each speaker. The copilot pulls the account's history from episodic memory: two prior calls, an open security-review question, a $48,000 quote sent in Q3. Ten minutes in, the prospect says, "your price is higher than the other vendor we're looking at." The copilot plans a single helpful step, calls the retriever for the approved objection-handling note and the value comparison, and surfaces them as a card to the rep alone — not aloud, not to the prospect. It does not comment on whether the prospect sounded annoyed; that would be emotion recognition, and it is not built. Near the end, someone says, "we'll send the revised proposal by Thursday." The action-item writer captures that as a task — owner: the rep, due Thursday — and holds it. The call ends. Now the after-meeting lane runs with no clock: the recap writer drafts the summary, the follow-up email, and a CRM update moving the deal to "proposal sent," and presents all three to the rep, who edits and approves them. Every primitive fired; the live work stayed inside the one-second budget; consent was taken; no emotion was scored; and a human pressed every button that touched a real system.

Build, Buy, Or Wrap

You have three honest options, and the right one depends on how central this feature is to your product. You can buy a finished copilot — turn on Microsoft 365 Copilot, Zoom AI Companion, or an Otter-style notetaker — which is fastest and right when the copilot is a convenience, not your product. You can wrap: drop a real-time agent into your own conferencing app using a framework like the LiveKit agents we cover in the LiveKit AI meeting assistant lesson, so you keep your own video stack and control consent, disclosure, and which tools the agent can call. Or you can build the pipeline from the primitives in this section — streaming ASR, a retriever, an LLM, TTS, an agent framework from the framework lesson — which is right when the copilot's behavior is your product, as it is for a specialized sales coach or a clinical scribe. Most teams should start by buying or wrapping to learn what their users actually want help with, then build only the parts the off-the-shelf copilots get wrong for their domain.

Where Fora Soft Fits In

We build video products across video conferencing, WebRTC real-time apps, telemedicine, e-learning, streaming, OTT, surveillance, and AR/VR, and the meeting copilot sits squarely in the conferencing and real-time work we ship. When a client wants a copilot inside their own calls, our design discipline is the one in this lesson: respect the one-second clock by keeping live work small and saving heavy work for after the call, give the agent a small sharp tool set, take consent before listening, disclose the AI, and put a human on every action that leaves the room. We keep the copilot at the level of what was said and decided rather than how participants felt, so a product clears the EU AI Act's bright lines by design instead of by configuration. The same skeleton serves a telemedicine intake scribe or an e-learning session assistant without rebuilding the agent for each vertical.

What To Read Next

Talk To Us · See Our Work · Download

  • Talk to a video engineer — scope a meeting-copilot or real-time agent feature for your conferencing app: /services/webrtc-development
  • See our case studies — conferencing, WebRTC, and real-time AI work: /portfolio
  • Download the Meeting-Copilot scoping & guardrails checklist — the loop, the tool belt, the latency budget, and the consent and EU AI Act lanes on one page: Download the checklist

References

  1. Microsoft — "Catch up on meetings with Microsoft 365 Copilot in Teams" and "Facilitator in Microsoft Teams meetings" (Microsoft Support, accessed June 2026) — https://support.microsoft.com/en-us/office/facilitator-in-microsoft-teams-meetings-37657f91-39b5-40eb-9421-45141e3ce9f6 — tier 4 (product deployer). Source for the Facilitator agent (real-time notes, light timekeeping, @mention to create/update/query tasks synced to a meeting plan) and in-meeting Copilot Q&A ("What are the action items?").
  2. Zoom — "Zoom launches AI Companion 3.0 with agentic workflows, transforming conversations into action" (Zoom Newsroom, 2026) — https://news.zoom.com/zoom-launches-ai-companion-3-0/ — tier 4 (product deployer). Source for agentic meeting workflows, live notes during a meeting, and Zoom Tasks auto-detecting and completing action items (scheduling follow-ups, generating documents).
  3. LiveKit — "Voice Agent Architecture: STT, LLM, and TTS Pipelines Explained" — https://livekit.com/blog/voice-agent-architecture-stt-llm-tts-pipelines-explained — tier 4 (vendor engineering). Source for the three-stage cascade pipeline and the dominance of the cascade approach (~90% of production agents in 2026) versus native speech-to-speech.
  4. LiveKit — "Turn Detection for Voice Agents: VAD, Endpointing, and Model-Based Detection" — https://livekit.com/blog/turn-detection-voice-agents-vad-endpointing-model-based-detection — tier 4 (vendor engineering). Source for the two-signal turn-detection model (acoustic VAD + a ~135M-parameter language model predicting end-of-turn) and barge-in handling.
  5. LiveKit — "Sequential Pipeline Architecture for Voice Agents" — https://livekit.com/blog/sequential-pipeline-architecture-voice-agents — tier 4 (vendor engineering). Source for end-to-end latency figures: cascade medians ~1.4–1.7 s with p99 ~3–5 s; native speech-to-speech ~320–800 ms; conversational turn-taking breaking down above ~1–2 s.
  6. European Union — Regulation (EU) 2024/1689 (Artificial Intelligence Act), Article 5 (Prohibited AI Practices), Art. 5(1)(f) — https://artificialintelligenceact.eu/article/5/ — tier 1 (primary legislation). The prohibition on AI inferring emotions in the workplace (and education), with narrow medical/safety exceptions; Chapter II in force since 2 February 2025.
  7. European Union — Regulation (EU) 2024/1689 (Artificial Intelligence Act), Article 50 (Transparency Obligations) — https://artificialintelligenceact.eu/article/50/ — tier 1 (primary legislation). Art. 50(1) requires informing people they interact with an AI system; Art. 50(3) requires informing people exposed to emotion-recognition systems; date of entry into force 2 August 2026 (per Article 113).
  8. Future of Privacy Forum — "Red Lines under the EU AI Act: Unpacking the Prohibition of Emotion Recognition in the Workplace and Education Institutions" — https://fpf.org/blog/red-lines-under-eu-ai-act-unpacking-the-prohibition-of-emotion-recognition-in-the-workplace-and-education-institutions/ — tier 3 (legal analysis). Source for the cumulative conditions of the Art. 5(1)(f) workplace emotion-recognition ban and its scope.
  9. Recording Law — "AI Meeting Recording Laws by State (2026)" and "California Recording Laws: All-Party Consent" — https://www.recordinglaw.com/us-laws/ai-meeting-recording-laws/ — tier 3 (legal reference). Source for the eleven all-party-consent states, the CIPA (Cal. Penal Code §§ 631, 632) all-party requirement, and CIPA penalties ($5,000 per violation or treble damages).
  10. Reed Smith — "The legality of AI-powered recording and transcription" (Employment Law Watch) — https://www.reedsmith.com/our-insights/blogs/employment-law-watch/102ls2n/the-legality-of-ai-powered-recording-and-transcription/ — tier 3 (legal analysis). Source for the federal one-party ECPA floor (18 U.S.C. § 2511), the stricter state all-party rules, and that a visible bot is not legally sufficient consent.
  11. UC Today — "Microsoft, Zoom and Google Tighten Meeting Bot Controls as Otter Case Nears Hearing" — https://www.uctoday.com/security-compliance-risk/ai-meeting-bots-controls-microsoft-zoom-google/ — tier 4 (industry reporting). Source for Google Meet flagging/denying third-party bots (March 2026), Microsoft Teams notice MC1251206 labeling external bots "Unverified," and the In re Otter.ai motion-to-dismiss hearing on 20 May 2026.
  12. LiveKit — "Voice agents" product documentation — https://livekit.com/voice-agents — tier 4 (vendor engineering). Source for the real-time agent-in-the-call deployment model used in the build-buy-wrap section.