Published 2026-06-02 · 21 min read · By Nikolay Sapunov, CEO at Fora Soft

Why this matters

If you are choosing a meeting-notes tool, the marketing pages will not tell you what you are really buying, because every product lists the same features — summaries, action items, CRM sync, chat-with-your-meeting. The decision that matters is architectural, and it is invisible from the pricing page. If you are building software — a sales platform, a telehealth app, a recruiting tool — the more pressing question is whether to keep paying these tools per seat or to put transcription inside your own product, and that choice has a clear cost and engineering shape once you see how the bots work. This lesson is the engineering companion to our meeting-transcription landscape overview: the landscape lesson maps the whole market into four patterns, and this one opens up the four most-searched bots and the build paths behind them. It is written so a product lead can follow every step, and a senior engineer can still trust every claim.

The anatomy every meeting bot shares

Before pulling the four products apart, it helps to see the assembly line they all run, because the differences only make sense against the shared design. A meeting bot does five things in order, and naming them keeps the rest of the lesson concrete.

First, it joins the meeting. Most bots do this the way a person would: a piece of software opens the meeting link in a web browser and clicks "join." The twist is that the browser has no screen, no mouse, and no human — it is what engineers call a headless browser, a real browser engine running on a server, driven by code instead of hands. To the meeting it looks like one more guest who kept their camera off; that is why the same bot can walk into Zoom, Google Meet, and Microsoft Teams without custom work for each.

Second, it captures the audio. Once inside the call, the bot receives everyone's microphone audio over WebRTC — the real-time audio-and-video technology browsers use for video calls, standardized by the W3C. A useful detail here: WebRTC tags each person's audio with a stream identifier, so a well-built bot can keep "this is Maria talking" separate from "this is Sam talking" instead of receiving one blended track. That per-speaker separation is what makes accurate speaker labels possible later.

Third, it runs speech recognition, the step that turns audio into raw text. The technical name is automatic speech recognition, or ASR. This is a commodity now — the same handful of engines sit under most products on the market, which is exactly why none of these four wins on raw accuracy. We cover the engines themselves in the streaming ASR lesson.

Fourth, it adds understanding — two layers stacked on the text. One layer is diarization, an awkward word that simply means labelling who said each line. The other is the summary: a large language model reads the full transcript and writes a recap plus a list of action items. When a feature page says "AI meeting summary," this layer is what it means.

Fifth, it delivers the result — pushing the transcript, summary, and action items into the tools your team already uses (a CRM like Salesforce or HubSpot, a docs tool like Notion, a chat tool like Slack), and, for the build-it products, firing a webhook — an automated message that tells your software "the meeting is done, here is the data."

Horizontal pipeline diagram titled the anatomy of a meeting bot, showing five left-to-right stages connected by arrows. Stage one, Join, a headless browser or native API enters the call as a participant. Stage two, Capture, per-speaker audio arrives over WebRTC with stream identifiers keeping each voice separate. Stage three, Speech recognition (ASR), audio becomes raw text using a commodity engine. Stage four, Understanding, diarization labels who said each line and a language model writes the summary and action items. Stage five, Deliver, results are pushed to CRM, docs, and chat tools and a webhook notifies your software. A caption below notes that all four products share these five stages and differ mainly at stage one, the capture method, and at stage five, the integrations and business model. Figure 1. The five stages every meeting bot runs. The four products in this lesson share stages two through four almost exactly; they differ at how they join (stage one) and how they package and sell the output (stage five).

Hold those five stages in mind. Each product below is a different set of choices about the same five steps.

Otter — the in-house speech engine, now in court

Otter is the giant of the category by search demand, and it is the most vertically integrated of the four. Where most note-takers rent their speech recognition from an outside provider, Otter built its own ASR engine, trained on a large corpus of audio it assembled over years, and pairs it with its own natural-language layer for summaries and its "OtterPilot" assistant. In practice this means Otter owns more of the five-stage pipeline than its rivals — the capture, the recognition, and the understanding are largely its own code rather than a thin wrapper over someone else's API.

Mechanically, Otter follows the standard bot pattern: it auto-joins meetings from your connected calendar across Zoom, Google Meet, and Microsoft Teams, transcribes in real time with live speaker labels and timestamps, captures shared slides, and produces a summary afterward. Its real-time path is a genuine strength — the transcript scrolls as people speak rather than appearing only after the call.

Otter's 2026 pricing has four tiers. Basic is free with 300 transcription minutes a month and a 90-minute cap per conversation. Pro is $16.99 per month, or $8.33 per month if you pay for a year up front, and raises the limit to 1,200 minutes a month. Business is $20 per user per month billed annually (or $30 month-to-month), removes the meeting cap, allows up to three meetings transcribed at once, and adds team workspaces and admin controls. Enterprise is custom-priced and is the only tier that unlocks API access, single sign-on, SOC 2 security attestation, and OtterPilot for Sales. The pattern to notice: the developer-relevant capability — API access — sits behind the most expensive door.

The reason Otter leads this lesson is not only its size. In 2026 it became the legal test case for the entire category. A class-action suit, Brewer v. Otter.ai (N.D. Cal., filed August 2025), alleges that Otter's bot joins meetings and records people who never agreed to it, and that its speaker-recognition step builds voiceprints without consent. Four related suits were consolidated in October 2025, and the claims invoke the federal wiretap statute (ECPA), California's Invasion of Privacy Act, and Illinois's Biometric Information Privacy Act. Whatever the outcome, the case has already changed how enterprises evaluate every bot in this lesson — which is why the consent section below is not a footnote but a core engineering concern.

Fireflies — the assistant on top, and the voiceprint problem

Fireflies takes the opposite integration bet from Otter: rather than owning the speech engine, it focuses its engineering on the layers above transcription — a conversational assistant called Fred (you query your meetings with "AskFred"), conversation analytics like talk-time ratios, and deep workflow automation into CRMs. Its bot, which appears in the participant list as "Fireflies.ai Notetaker," joins across the same platforms and behaves like the standard headless-browser pattern.

Two engineering details make Fireflies distinctive. The first is its AI-credits model. Plain transcription and recap emails are unlimited on paid tiers, but the higher-value AI actions — AskFred queries, custom summaries, CRM autofill, the Sales Assist and Voice Agent features — draw down a credit pool. Free and Pro plans include a one-time pool of about 20 credits, Business 30, Enterprise 50, with add-on bundles available. For a buyer this matters because the headline "unlimited" applies to transcription, not to the AI features people actually adopt the tool for. For a builder it is a useful lesson in metering: the cheap, commoditized step (transcription) is bundled, and the expensive step (LLM inference) is metered.

The second detail is the one now drawing litigation. Fireflies' "Speaker Recognition" feature, which is what produces clean per-speaker labels, works by building a voiceprint — a mathematical fingerprint of a person's voice. In December 2025, Cruz v. Fireflies.AI was filed in Illinois on behalf of a person who was never a Fireflies customer; she joined a meeting where a host had enabled the bot, and the suit alleges the bot generated a voiceprint of her without consent, in violation of Illinois's biometric-privacy law. The mechanism the suit targets — diarization via voiceprint — is the same one nearly every product in this lesson relies on. That is the single most important engineering takeaway of the whole article, and we return to it below.

Fireflies' 2026 pricing: Free ($0, with 800 minutes of storage and the 20-credit pool), Pro at $10 per user per month annually ($18 monthly) with unlimited transcription and 8,000 minutes of storage per seat, Business at $19 with CRM sync and conversation intelligence, and Enterprise at $39. The CRM integrations most teams want — Salesforce, HubSpot — start at the Pro tier.

Fathom — free-first as an engineering and growth strategy

Fathom's defining choice is its business model, and that choice shapes its engineering. Where Otter and Fireflies meter minutes on their free tiers, Fathom gives away unlimited recording and storage forever and instead caps the AI layer — the free plan allows only five AI-generated summaries a month. The recording and transcript are free; the understanding is what you pay for.

This is a deliberate inversion of the cost structure. Capture and storage are cheap and getting cheaper; LLM-generated summaries cost real money per call because each one is an inference request against a large model. By giving away the cheap part and charging for the expensive part, Fathom acquires users at near-zero marginal cost and monetizes only the high-value action. Premium runs $19 a month (about $15 annually) for unlimited summaries, action items, and follow-up emails; Team Edition adds admin controls, shared clips, and analytics from $29 a month (about $19 annually), with a Team Edition Pro above it; a Business tier around $25 per user adds CRM sync.

Functionally, Fathom joins Zoom, Google Meet, and Teams as a bot, records and transcribes, then offers a ChatGPT-style chat over the meeting and more than fifteen summary templates tuned to specific workflows — sales frameworks like BANT and Sandler among them. Its integrations cover Slack, Salesforce, HubSpot, Notion, and Asana. For a builder, Fathom is the clearest case study in the lesson that the transcript is a commodity and the summary is the product — exactly the split we model in the real cost of AI in video products lesson.

Supernormal — the one that can skip the bot

Supernormal is the architectural outlier, and it is the "golden" keyword of this lesson — high intent, low competition. Its distinguishing engineering choice is capture flexibility: instead of forcing a bot into every call, it offers three capture modes — a traditional meeting bot, a Chrome extension, and a bot-free desktop app that records your computer's own sound without appearing in the participant list. That last mode is the bot-free pattern we cover in the landscape lesson, and offering all three lets a customer pick visibility per meeting rather than per product.

On the understanding layer, Supernormal runs a hybrid of a frontier model (GPT-4o) and its own proprietary models to produce its summaries — branded "The Gist" — which is a common 2026 pattern: a strong general model for fluency, smaller in-house models for the cheap, repetitive classification work. In 2026 Supernormal also repositioned from "meeting recorder" to "AI agent for agencies," using the meeting transcript as context to do work — drafting presentations, spreadsheets, and research reports from what was discussed — and a Memory feature that carries context across meetings with the same people.

Its pricing is the most accessible of the four: Free covers 15 meetings a month, Starter is $16 a month, Pro is $25 a month, and Enterprise is custom (roughly $40 per user), with about 20% off for annual billing. On compliance it is strong for a tool its size — SOC 2 certified, HIPAA-compliant with a Business Associate Agreement available, and GDPR-aligned — which makes it a credible pick for healthcare and other regulated settings where the bigger names require an enterprise contract to match.

The four side by side

The table below is a 2026 snapshot. Read it by architecture and model, not by the feature checkboxes, which are nearly identical across all four. Verify current numbers before committing — this category re-prices often.

Product Search demand (KD) Capture method Speech engine bet Signature layer Paid entry (annual) Notable engineering fact
Otter otter ai 141K (KD 67) Cloud bot, real-time In-house ASR OtterPilot, live transcript $8.33/user/mo (Pro) API only on Enterprise; lead defendant in 2026 wiretap suit
Fireflies fireflies ai 21K (KD 49) Cloud bot Third-party ASR + own layers AskFred + AI-credit metering $10/user/mo (Pro) Voiceprint diarization at center of Illinois BIPA suit
Fathom fathom ai 11K (KD 42) Cloud bot Third-party ASR Free unlimited recording, paid summaries ~$15/user/mo (Premium) Inverts cost model: gives away capture, charges for AI
Supernormal supernormal ai 200 (KD 9) Bot or bot-free or extension GPT-4o + proprietary hybrid Agentic output + Memory $16/mo (Starter) Only one of the four that can capture without a bot

Two patterns jump out. First, search demand and engineering depth are uncorrelated: Otter has 700× the search volume of Supernormal but Supernormal is the only one with capture flexibility and the lowest keyword difficulty to rank against. Second, the real differences are at the two ends of the pipeline — how they capture (stage one) and how they package and price the output (stage five) — exactly as Figure 1 predicted.

How to build this into your own product

Here is the question that actually pays. If you ship software where meetings happen — a sales tool, a telehealth platform, a recruiting product — you can stop paying per seat and put transcription inside your own app. There are three engineering paths, and they trade off control against effort in a predictable way.

Three-column comparison diagram titled three ways to build a meeting bot into your product. Column one, Rent a capture API, shows a single API such as Recall.ai that sends a bot into any platform and returns audio, video, and transcript, labelled fastest to ship, usage-based cost, you do not maintain browsers. Column two, Use a native platform API, shows Zoom Real-Time Media Streams and the Google Meet Media API delivering live media with no extra participant, labelled no visible bot, lowest latency, but locked to one platform and gated by the host. Column three, Extend your own real-time pipeline, shows a LiveKit agent or SFU-side ASR inside your existing WebRTC stack, labelled most control, no per-meeting fee, only works when the meeting is already your product. A footer line notes that the right path depends on whether the meeting happens on a platform you do not control or inside your own application. Figure 2. The three build paths. Pick by where the meeting lives: on a third-party platform you must visit (rent a bot API, or use that platform's native stream), or inside your own application (extend the pipeline you already run).

Path one — rent a capture API. The fastest route is to never build a bot at all. An infrastructure provider like Recall.ai gives you one API that dispatches a meeting bot (or a bot-free desktop recorder) into Zoom, Teams, Google Meet, Webex, and more, and hands your product back the audio, video, and transcript through a webhook. You never operate a fleet of headless browsers, deal with each platform's quirks, or chase their UI changes. As of 2026, Recall.ai charges $0.50 per recording hour for both its bot API and its desktop SDK — down from $0.70 earlier in the year — plus $0.15 per hour for built-in transcription, with calendar integration free and no monthly platform fee.

Here is that math made concrete. Suppose your product's users collectively run 8,000 hours of recorded calls a month. Capture is 8,000 × $0.50 = $4,000. Add transcription at 8,000 × $0.15 = $1,200. Your monthly capture-and-transcription bill is $4,000 + $1,200 = $5,200, on top of which you add your own summary-model cost and storage. Whether that beats per-seat tools depends entirely on how many seats those 8,000 hours represent — which is the whole reason to run the numbers rather than guess.

Path two — use a native platform streaming API. In 2026 the meeting platforms began offering a cleaner alternative to bots. Zoom's Real-Time Media Streams (RTMS) is now generally available: it pipes live audio, video, and transcript data straight to your app with no extra participant in the call and millisecond-level delivery. Google's Meet Media API gives the same kind of native access to real-time streams over WebRTC. The catch is platform lock-in and gating: RTMS is Zoom-only and works only when the host's organization has enabled it; the Meet API is Google-only and still preview-gated; and Microsoft Teams has no direct equivalent — its real-time media platform expects application-hosted media bots written in C#/.NET. So a native API is the lowest-latency, bot-free option for a single platform, but you still need a fallback (usually a rented bot) for the platforms that lack one.

Path three — extend a real-time pipeline you already own. If the meeting is your product — a conferencing or telehealth app you already run on WebRTC — you do not need a bot to join from outside, because your media server is already in the call. You can transcribe centrally on the server and fan captions out to everyone (the SFU-side ASR pattern), run a LiveKit agent as a silent participant that produces the same note-taker output these products sell, or push recognition all the way into the browser. This path has no per-meeting fee and total control over the data, but it only applies when you own the call.

The decision among the three is not about which is "best" — it is about where the meeting lives. If your users meet on platforms you do not control, you rent a bot API (and optionally layer native APIs where available). If the meeting happens inside your own application, you extend your own pipeline. Most products that need broad coverage start by renting, then add native APIs and their own pipeline as volume justifies the engineering.

The consent layer is an engineering requirement, not a legal afterthought

Every build path above shares one obligation that the 2026 lawsuits turned from "nice to have" into "ship-blocker." Treat the following as engineering-relevant context, not legal advice — confirm specifics with qualified counsel for your jurisdiction.

Two facts collide. First, recording-consent law in the United States splits by state: 39 states plus the District of Columbia allow one-party consent (if you are in the conversation, you may record), but 11 — including California, Illinois, and Pennsylvania — require all parties to agree. Under the EU's GDPR you need a lawful basis and informed consent to capture someone's voice at all. Second, and this is the part engineers miss: the diarization feature that gives you clean speaker labels works by building a voiceprint, and a voiceprint is increasingly treated as a protected biometric identifier — the exact theory behind the Illinois suits against both Otter and Fireflies. The feature that makes your transcript readable is the same feature that creates legal exposure.

Common pitfall: assuming the visible bot is consent. A bot named "Notetaker" in the participant list is not, by itself, the informed agreement the law requires — no major jurisdiction treats "they could see it" as consent. And a bot-free capture mode is quieter, not safer: removing the visible participant removes the social signal but increases your disclosure duty, not decreases it. If you build any of the three paths above, you must design consent in from the first sprint: disclose the AI before it engages, capture explicit agreement (especially in all-party-consent states and for any voiceprint/diarization step), give participants a real way to decline, and keep a retention and deletion schedule. The EU AI Act's Article 50 makes pre-engagement disclosure a hard product requirement in the EU. This is the same disclosure discipline we cover in the EU AI Act and disclosure engineering lesson.

Vertical flow diagram titled where consent must live in a meeting bot, showing the bot lifecycle down the left with legal checkpoints flagged on the right. Step one, bot is invited or auto-joins from calendar, checkpoint disclose the AI presence before it engages. Step two, bot enters the call and appears or stays hidden, checkpoint a visible name in the participant list is not legal consent. Step three, bot captures per-speaker audio, checkpoint all-party-consent states require everyone to agree before recording. Step four, diarization builds a voiceprint to label speakers, checkpoint a voiceprint is biometric data and may need explicit opt-in under laws like Illinois BIPA. Step five, transcript and summary stored and delivered, checkpoint keep a retention and deletion schedule and let participants decline. A footer notes that these checkpoints apply whether you rent a bot API, use a native platform stream, or build your own pipeline. Figure 3. Consent is not one gate at the start — it attaches to specific steps of the bot's lifecycle, and the voiceprint step (diarization) is the one currently in court.

Where Fora Soft fits in

We build the video products that these transcription features live inside — conferencing platforms, telemedicine apps, e-learning tools, and surveillance systems — so we meet the buy-versus-build decision from the engineering side regularly. When a client only needs notes beside their meetings, we help them pick a tool and integrate it cleanly. When transcription has to live inside the product — a telehealth visit note generated on the platform itself, live captions fanned out to every webinar viewer, a sales summary rendered on the client's own dashboard — we build it on a real-time WebRTC pipeline so capture is native to the app rather than bolted on by an outside bot, with consent, disclosure, and a deletion schedule designed in from the first sprint. The three build paths in this lesson are the same ones we weigh with clients when they decide whether to keep renting a note-taker or own the capability.

What to read next

Talk to us / See our work / Download

  • Talk to a video engineer — scope a meeting-bot or in-product transcription feature for your conferencing, telehealth, or sales platform: book a 30-minute call.
  • See our case studies — real-time video and AI features we have shipped: our work.
  • Download the Meeting Bot Engineering Cheat Sheet — the five-stage anatomy, the four products at a glance, the three build paths with the Recall.ai cost math, and the consent checkpoints on one page: Meeting Bot Engineering Cheat Sheet.

References

  1. Otter.ai — Pricing and product pages (2026). Tiers and limits: Basic free 300 min/mo with 90-min per-meeting cap; Pro $16.99/mo or $8.33/mo annual at 1,200 min/mo; Business $20/user/mo annual ($30 monthly), no meeting cap, 3 concurrent meetings; Enterprise custom with API access, SSO, SOC 2, OtterPilot for Sales. Tier 4 (vendor primary). https://otter.ai/pricing
  2. Otter.ai — Product/About (how it works). In-house ASR trained on a large audio corpus, paired with NLP and ML; real-time transcription with speaker labels and timestamps across Zoom, Google Meet, Teams; ~4 languages (English US/UK, Spanish, French). Tier 4. https://otter.ai/
  3. The National Law Review / CIPAWorld — "New Wave of Privacy Litigation Targets AI Notetaker, Otter.ai" (Nov 2025). Brewer v. Otter.ai, No. 5:25-cv-06911 (N.D. Cal., filed 15 Aug 2025); four suits consolidated 22 Oct 2025 under Judge Eumi K. Lee; claims under ECPA, CIPA, BIPA, CFAA; alleges recording non-users and building voiceprints without consent. Tier 7 (legal press; confirm with counsel). https://natlawreview.com/article/take-note-new-wave-privacy-litigation-targets-ai-notetaker-otterai
  4. Fireflies.ai — Pricing and Knowledge Base (2026). Free $0 (800-min storage, ~20 one-time AI credits); Pro $10/user/mo annual ($18 monthly) unlimited transcription + 8,000 min/seat storage; Business $19 with CRM sync + conversation intelligence; Enterprise $39; AI-credit metering for AskFred/Sales Assist/CRM autofill/Voice Agents; Salesforce/HubSpot/Slack from Pro. Tier 4. https://fireflies.ai/pricing
  5. Epstein Becker Green / National Law Review — "AI Meeting Assistants and Biometric Privacy: Lessons from the Fireflies.AI Lawsuit" (2026). Cruz v. Fireflies.AI Corp., 3:25-cv-03399 (N.D. Ill., Dec 2025); non-user plaintiff; "Speaker Recognition" generates voiceprints; Illinois BIPA defines voiceprints as biometric identifiers requiring notice and written consent. Tier 7 (legal analysis; confirm with counsel). https://www.ebglaw.com/insights/publications/ai-meeting-assistants-and-biometric-privacy-lessons-from-the-fireflies-ai-lawsuit
  6. Fathom — Pricing and Help Center (2026). Free: unlimited recording + storage forever, 5 AI summaries/mo cap; Premium $19/mo (~$15 annual) unlimited summaries/action items/follow-ups; Team Edition from $29/mo (~$19 annual) with admin/clips/analytics, Team Edition Pro above it; Business ~$25/user adds CRM sync; 15+ templates (BANT, Sandler); ChatGPT-style chat; Slack/Salesforce/HubSpot/Notion/Asana. Tier 4. https://www.fathom.ai/pricing
  7. Supernormal — Product, pricing, and security pages (2026). Capture modes: meeting bot, Chrome extension, and bot-free desktop app; GPT-4o + proprietary hybrid for "The Gist"; 2026 reposition to agentic output (decks, sheets, reports) + Memory; Free 15 meetings/mo, Starter $16/mo, Pro $25/mo, Enterprise ~$40/user; SOC 2, HIPAA (BAA), GDPR. Tier 4. https://www.supernormal.com/
  8. Recall.ai — "How to build a meeting notetaker" + "New Recall.ai Pricing for 2026" (updated May 2026). Bot dispatched into Zoom/Teams/Meet/Webex returns audio/video/transcript via webhook; $0.50/recording hour for Bot API and Desktop SDK (down from $0.70), $0.15/hr transcription, free Calendar API, no platform fee. Tier 4 (production deployer). https://www.recall.ai/blog/new-recall-ai-pricing-for-2026
  9. Recall.ai — "What is Zoom RTMS?" + Zoom Developer Docs "Realtime Media Streams." RTMS in GA: live audio/video/transcript to your app with no extra participant and millisecond delivery; Zoom-only and host-org-gated. Native bot-free capture for a single platform. Tier 4. https://developers.zoom.us/docs/rtms/
  10. Recall.ai — "What is the Google Meet Media API?" Native first-class access to real-time Meet media over WebRTC with no extra participant; Google-only and preview-gated; Microsoft Teams has no direct equivalent (app-hosted media bots in C#/.NET). Tier 4. https://www.recall.ai/blog/what-is-the-google-meet-media-api
  11. W3C Recommendation — "WebRTC: Real-Time Communication in Browsers" (13 March 2025). The finished real-time transport a meeting bot uses to join a call and receive each participant's audio/video, including the per-stream identifiers that enable per-speaker separation. Official standard (final Recommendation). https://www.w3.org/TR/webrtc/
  12. IETF RFC 6716 — "Definition of the Opus Audio Codec" (September 2012). The default WebRTC audio codec carrying each participant's microphone audio that every bot decodes before ASR. Official standard (final RFC). https://www.rfc-editor.org/rfc/rfc6716
  13. Regulation (EU) 2024/1689 (EU AI Act), Article 50 — Transparency obligations. Duty to inform people when they interact with or are processed by certain AI systems — the legal basis for disclosing AI transcription before a bot engages. Official EU regulation. https://eur-lex.europa.eu/eli/reg/2024/1689/oj
  14. Regulation (EU) 2016/679 (GDPR). Lawful-basis and informed-consent requirements for capturing voice and other personal data — the European baseline a visible bot does not by itself satisfy. Official EU regulation. https://eur-lex.europa.eu/eli/reg/2016/679/oj
  15. Illinois Biometric Information Privacy Act, 740 ILCS 14 (BIPA). Defines a voiceprint as a biometric identifier and requires notice, written consent, and a public retention/deletion schedule before collection — the statute at the center of the Otter and Fireflies suits. Official statute (US state law). https://www.ilga.gov/legislation/ilcs/ilcs3.asp?ActID=3004