This is engineering guidance, not legal advice. Confirm specifics with qualified counsel.

Why this matters

If you are scoping a telemedicine product, the topology decision quietly sets your infrastructure cost, your maximum session size, your recording architecture, and which video vendors are even on your shortlist. It is one of the few choices that is genuinely hard to reverse after launch, because it is wired into the client, the server, and the billing model all at once. A founder who picks peer-to-peer to "save on servers" discovers it the first time a clinic runs a four-person care-team meeting and every laptop fan spins up. A team that defaults to the most powerful server type for everything discovers it on the first cloud invoice. This article gives you the mental model to make the call deliberately, in week one, and to test whether a vendor's pitch survives a clinical use case rather than a demo.

The one question every group call has to answer

A video call is, underneath, a routing problem. Each participant produces a stream of audio and video, and that stream has to reach everyone else in the call with under half a second of delay. The reason a two-person call feels simple and an eight-person call feels hard is that the number of connections grows much faster than the number of people.

Before we name the three topologies, fix one term. A participant's upstream is the data their device sends out — their own camera and microphone, encoded and pushed onto the network. Their downstream is everything they receive. Home internet connections are deliberately lopsided: a typical residential link might offer 100 Mbps down but only 10–20 Mbps up, because consumers download far more than they upload. In a video call, upstream is the scarce resource, and the topology you choose decides how much upstream each person needs. Keep that asymmetry in mind — it is the hinge the entire decision turns on.

The three answers differ in one variable: where the copying happens. Either each sender makes its own copies and ships them to every recipient (peer-to-peer), or a server makes the copies (an SFU), or a server makes one merged picture so no copying is needed downstream (an MCU). That is the whole taxonomy. Everything else is consequence.

Three WebRTC topologies side by side: peer-to-peer mesh with every device connected directly, SFU forwarding separate streams, MCU mixing into one combined stream Figure 1. The three topologies. Peer-to-peer has no server in the media path; an SFU forwards each stream untouched; an MCU blends every stream into a single combined picture. The server, when present, sits inside the HIPAA boundary.

For the packet-by-packet internals of how an SFU forwards media, how simulcast layers work, and how WebRTC negotiates all of this, the Video Streaming section has the deep dive — SFU, MCU, and mesh topologies explained. Here we stay on the clinical decision: which one fits which kind of visit, and what each costs you in money and in patient experience.

Peer-to-peer: perfect for two, useless for three

Peer-to-peer — also called a full mesh when more than two people join — means each device sends its stream directly to each other device, with no server in the middle of the media. (A signaling server still helps the browsers find each other at the start; it just steps out of the way once media flows.) Remember from WebRTC for telemedicine that even a "direct" connection often passes through a TURN relay when a firewall blocks the direct path — but the relay only forwards bytes blindly, so the topology is still peer-to-peer in spirit.

For a one-to-one consult, peer-to-peer is the best answer there is. Latency is lowest because the media takes the shortest path. There is no media server to run, scale, or pay for. And the audio and video never touch your infrastructure, which is a genuine privacy advantage — fewer servers in the media path means fewer servers inside your HIPAA boundary to secure and cover with contracts.

Then a third person joins, and the arithmetic turns hostile. In a mesh, every participant must send a separate copy of their own video to every other participant. Walk the upstream math out loud, assuming a normal 720p stream of about 1.5 Mbps:

2 people: each sends 1 copy  → 1 × 1.5 = 1.5 Mbps up    — comfortable
3 people: each sends 2 copies → 2 × 1.5 = 3.0 Mbps up    — strains home uplinks
4 people: each sends 3 copies → 3 × 1.5 = 4.5 Mbps up    — exceeds many; phones throttle
5 people: each sends 4 copies → 4 × 1.5 = 6.0 Mbps up    — fails on most home networks

The pattern is that each person's upstream grows with every participant added, and it grows on the weakest device in the call — which in telemedicine is almost always the patient's phone, not the clinician's workstation. Encoding four separate video streams also burns CPU and battery; the phone heats up, the fan-less tablet throttles, and the picture degrades for everyone. This is why mesh is a dead end past two or three participants, and why no serious telemedicine platform uses it for group sessions.

Bandwidth chart: peer-to-peer upstream rises with each participant while SFU and MCU stay flat at one stream Figure 2. Why the third participant breaks peer-to-peer. Mesh upstream climbs with every person added; a server topology holds each person's upstream flat at a single stream.

There is a second, quieter reason production telemedicine routes even 1:1 calls through a server: a pure peer-to-peer call gives your operations team almost nothing to work with — no server-side quality metrics, no place to attach a recorder, no way to apply regional routing. We will come back to that after the server topologies are on the table.

SFU: the workhorse of clinical video

An SFU — a Selective Forwarding Unit — is a server that sits in the middle of the call and forwards streams without changing them. Each participant uploads their stream once, to the server, and the server forwards a copy to everyone else. The "selective" part is the cleverness: the server can decide, per recipient, which streams and which quality levels to send, so a patient on a weak phone receives smaller versions while a clinician on fiber receives full quality.

The effect on the scarce resource is decisive. Each person's upstream stays flat at one stream — 1.5 Mbps — no matter how many people are in the session. A patient's phone encodes its camera once and uploads once, whether the call has three participants or thirteen. The cost of all the copying moves to the server's bandwidth, which is a line item you can forecast and buy, instead of to the patient's uplink and battery, which you cannot.

That single property is why the SFU is the default for essentially every multi-party clinical session:

  • Group therapy and intensive outpatient programs. Behavioral health moved to virtual delivery at scale — therapy delivered by telehealth in the US jumped from under 40% to roughly 88% of sessions after 2020 [6] — and a group session of eight to twelve participants is only feasible on an SFU. Platforms built for this, like the behavioral-health tools clinicians use today, add moderator layouts, active-speaker highlighting, and the ability to mute or remove a participant [7] — all of which require the server to see each stream separately, which is exactly what an SFU preserves and an MCU destroys.
  • Family-joined and caregiver visits. An adult child dialing into an elderly parent's appointment from another city is a three-party call — already past the mesh limit.
  • Multi-provider and supervised consults. A specialist joining a primary-care visit, or a supervising physician observing a resident, needs every face visible and independently controllable.
  • Interpreter-assisted visits. A medical interpreter on the call needs their own audio channel and their own video tile; the multi-party and multi-role consults article covers the role and permission mechanics layered on top of the SFU.

The SFU also keeps each participant's video as a separate track, which means the layout lives on each person's screen, not baked into the stream. A clinician can pin the patient's face, switch to a grid for a group, or hide their own self-view, all without asking the server to do anything different. That flexibility is clinically useful and it is the SFU's structural advantage over the MCU.

The cost is real but moderate and predictable. As an order-of-magnitude anchor from published infrastructure benchmarks, an SFU serving on the order of 100 concurrent participants runs roughly $300–500 per month in server bandwidth, with no heavy server-side processing [1]. The trade you accept is that the call is no longer peer-to-peer: the SFU is now a server in your media path, and so it sits squarely inside your HIPAA boundary. It must be covered by a Business Associate Agreement if a vendor runs it, or self-hosted inside your own covered environment, with audit logging and access controls around it.

MCU: powerful, expensive, and only sometimes right

An MCU — a Multipoint Control Unit — is the oldest of the three designs, inherited from the conference-room hardware era. Instead of forwarding streams, the MCU does the heavy work itself: it decodes every incoming stream, composites them into a single merged picture (the familiar grid or picture-in-picture layout), re-encodes that one combined stream, and sends it to each participant. From the participant's point of view, the call looks like a single incoming video — one stream down, one stream up — regardless of how many people are present.

That single-stream output is the MCU's whole reason to exist, and it solves two specific problems that an SFU cannot:

  • The low-powered or legacy endpoint. A device that can only decode one video stream — an old waiting-room tablet, an exam-room video cart, a set-top box, or an integration with legacy hospital conferencing hardware — cannot handle the four or eight separate streams an SFU sends. The MCU pre-mixes everything into the one stream the device can play. If your product must reach a fleet of cheap or fixed endpoints, this is the argument for an MCU.
  • The single combined recording. When you need one tidy video file of the whole session — the grid view, everyone in one frame, ready for the medical record or later review — the MCU produces it natively because it already composites the picture. An SFU records each track separately, which then requires a separate compositing step to produce the same combined file. (Many platforms solve recording with a dedicated server-side recorder rather than a full MCU; the recording clinical sessions article covers the architecture, and the encryption tension recording creates is the subject of end-to-end encryption in telehealth.)

Decision tree mapping participant count, endpoint capability, and recording needs to the right topology — peer-to-peer, SFU, or MCU Figure 3. The clinical topology decision. Participant count rules out mesh past two; endpoint power and combined-recording needs are the only reasons to pay for an MCU.

What you pay for that power is steep. Decoding, mixing, and re-encoding video in real time is processor-intensive work, and the MCU does it for every session continuously. Its CPU cost is far higher than an SFU's, because an SFU never decodes or re-encodes anything — it just forwards [1][5]. Published benchmarks put a broadcast-grade mixed-output deployment in the range of $2,000–5,000 per month against the $300–500 an SFU costs for a comparable participant load, and one enterprise reported saving roughly $200,000 a year by moving most conferences off an MCU-everywhere design onto a hybrid that used the MCU only where it was actually needed [1]. The re-encoding also adds latency — the extra decode-mix-encode round adds tens to a couple hundred milliseconds — and it destroys the per-participant layout flexibility, because every viewer receives the same baked-in grid.

A worked decision: three real telemedicine sessions

Put the model to work on three sessions a telemedicine product actually runs.

A routine medication-refill consult is one patient, one clinician, ten minutes, no recording, on the patient's phone. Peer-to-peer is the textbook fit: lowest latency, no server cost, media off your infrastructure. In practice many platforms still route it through their SFU for the operational visibility — but on topology alone, two people means peer-to-peer is defensible.

A group therapy session is one therapist and nine clients for fifty minutes, with active-speaker view and the ability to mute a participant. Mesh is impossible — at ten people each phone would need to upload nine copies, about 13.5 Mbps of upstream, which no home connection sustains. An MCU would work but would flatten everyone into one grid and remove the therapist's per-tile control. The SFU is the clear answer, and it is what real behavioral-health platforms use.

A tele-ICU monitoring feed to an exam-room cart is a fixed, low-powered endpoint that must display a combined view of several remote participants and record the session as one file for the record. Here the SFU's separate streams are a liability — the cart cannot decode them all — and the MCU's single composited stream is exactly right. This is the minority case where the MCU's cost is justified by a hard endpoint constraint, and it is covered further in tele-stroke, tele-ICU, and acute care.

The lesson is that the same product needs different topologies for different sessions. Mature telemedicine platforms run a hybrid: an SFU as the default for live multi-party consults, peer-to-peer (or SFU) for 1:1, and an MCU or a dedicated compositing recorder invoked only for the sessions that genuinely need a single combined stream. The decision is per-session, not per-platform.

Comparing the three at a glance

Criterion Peer-to-peer (mesh) SFU MCU
Server in media path No Yes (forwards streams) Yes (mixes streams)
Upstream per participant Grows with party size Flat — 1 stream Flat — 1 stream
Server CPU cost None Low (no transcoding) High (decode + mix + encode)
Typical monthly cost (~100 concurrent) ~$0 server ~$300–500 [1] ~$2,000–5,000 [1]
Added latency Lowest Low (forward only) Higher (re-encode adds tens–200 ms)
Layout control Per client Per client (separate tracks) Fixed (baked-in grid)
Recording Build it client-side; fragile Per-track; composite separately Single combined file, native
Best clinical fit 1:1 consult Group therapy, family, multi-provider, interpreter Low-power/legacy endpoint, single combined recording
BAA / PHI boundary TURN relay only SFU inside boundary — BAA or self-host MCU inside boundary — BAA or self-host

Table 1. The topology decision at a glance. Participant count rules out mesh past two; only an endpoint constraint or a combined-recording requirement justifies the MCU's cost. Every server in the media path needs a signed BAA or must be self-hosted, because it handles patient video — Protected Health Information.

A note on that last row, because it is the one teams skip. Whether a server "sees" the unencrypted media or only forwards encrypted packets, the conservative and correct reading for healthcare is that any server in the media path is inside your compliance boundary. An SFU that forwards already-encrypted packets, an MCU that decrypts and re-mixes them — both handle the patient's session and both need a Business Associate Agreement. The MCU is the stricter case: because it decodes and re-encodes, it necessarily has the media in the clear in memory, which rules out any true end-to-end-encryption posture. If a behavioral-health product needs E2EE, the MCU is off the table by definition.

Common mistake: choosing the topology from the demo instead of the hardest real session. A vendor demo is two people on good office Wi-Fi, where every topology looks identical. The architecture only reveals itself under the ten-person group session on home networks, the family member joining from a phone on cellular, or the exam-room cart that can decode exactly one stream. Scope the topology against your worst clinical session, not your prettiest demo — and ask the vendor what happens when a third participant joins, because that single question separates the topologies.

How the build-vs-buy choice maps onto topology

You rarely build an SFU or MCU from scratch. The practical decision is whether you self-host an open-source media server or buy a managed video API, and that decision is the subject of choosing the video layer. Topology is one input to it.

Open-source SFUs — mediasoup, Janus, LiveKit — give you full control and the lowest per-minute cost at scale, at the price of operating the servers yourself and signing your own infrastructure into your HIPAA boundary. Managed video APIs run the SFU (and, where offered, the MCU or a recording composer) for you, and the topology is mostly hidden behind their API — but the one thing you must verify is the Business Associate Agreement, because that server is now a vendor touching patient data. The state of the market in 2026 is that most established healthcare-focused APIs will sign a BAA, but the terms and the price differ sharply:

Video API / SDK BAA available? Notes (as of 2026)
Whereby Embedded Yes BAA free on Enterprise; HIPAA add-on ~$16.99/mo on the Build plan; recordings to your own S3 [4]
Daily Yes HIPAA + BAA via a healthcare add-on at $500/mo; built by WebRTC spec authors [4]
Vonage Video API Yes Video, voice, and SMS under a single BAA; ongoing third-party HIPAA audits [4]
Twilio Video Yes (addendum) HIPAA-eligible via Business Associate Addendum, enterprise plan; group rooms for multi-party; note its 2023 sunset-then-reversal history [4]
Zoom Video SDK Yes (qualifying plans) BAA required and only on qualifying paid plans; some AI features unavailable under a BAA; configuration is involved [4]
Pexip Yes Enterprise; supports on-premise / private-cloud hosting and composited output for hospital endpoints [4]
CometChat Yes Video, voice, messaging under one BAA; HIPAA, SOC 2, HITRUST certifications [4]

Table 2. A snapshot of BAA availability across managed video APIs (2026). "BAA available?" is the first filter for any healthcare build — an encrypted API without a signed BAA is still a HIPAA violation. Verify current terms with each vendor; offerings change. The full comparison with the cost-at-scale analysis lives in choosing the video layer.

To make the decision repeatable, we condensed it into a one-page decision sheet: the participant-count and endpoint questions that pick the topology, the cost anchors, and the BAA checks that gate any vendor. Download the clinical video topology decision sheet and run it against your sessions before you commit to an architecture.

Where Fora Soft fits in

Fora Soft has built real-time video since 2005 across conferencing, streaming, surveillance, e-learning, and telemedicine, and the topology question is one we answer on every project — usually toward a hybrid. In telemedicine the ordering is compliance first: we place every media server, SFU or MCU, inside the HIPAA boundary with a BAA or self-hosting, then choose the topology per session type so a 1:1 refill consult stays cheap and a ten-person group session stays reliable. We have built on open-source SFUs and on managed APIs, and the right answer depends on your scale, your recording needs, and the weakest endpoint you must serve. If you want your session mix mapped to a topology and a cost before you build, talk to our telemedicine team.

What to read next


Talk to our telemedicine team — get your session mix mapped to a topology and a cost by engineers who have shipped compliant clinical video: telemedicine architecture review.

See our case studies — telemedicine and real-time video products we have built: our work in telemedicine.

Download the clinical video topology decision sheet — the participant, endpoint, recording, and BAA questions that pick your architecture: get the PDF.

Call to action

References

  1. Ant Media — Mesh vs SFU vs MCU: Choosing the Right WebRTC Network Topology, https://antmedia.io/webrtc-network-topology/ — order-of-magnitude server-cost and CPU comparison (SFU ≈ $300–500/mo vs MCU ≈ $2,000–5,000/mo at ~100 concurrent; hybrid savings example); checked 2026-06-13. Vendor engineering source. Tier 4.
  2. W3C — WebRTC: Real-Time Communication in Browsers, W3C Recommendation (first published 2021-01-26; current edition 2025-03-13), https://www.w3.org/TR/webrtc/ — the browser real-time media standard underlying every topology. Tier 1.
  3. IETF — RFC 8825 Overview: Real-Time Protocols for Browser-Based Applications (2021); RFC 8834 Media Transport and Use of RTP in WebRTC (2021); RFC 8656 TURN (2020), https://www.rfc-editor.org/rfc/rfc8825 — the protocol suite; RTP usage that SFUs forward; TURN relay in the peer-to-peer path. Tier 1.
  4. Whereby — The Best HIPAA-Compliant Video Call APIs for Telehealth Platforms (2026-05-12), https://whereby.com/blog/best-hipaa-compliant-video-call-apis/ — BAA availability and pricing snapshot across Whereby, Daily, Vonage, Twilio, Zoom SDK, Pexip, CometChat; Twilio group-rooms multi-party and sunset-reversal note. Vendor source; verify terms per vendor. Tier 4.
  5. IETF — RFC 7667 RTP Topologies (2015), https://www.rfc-editor.org/rfc/rfc7667 — the standards-track taxonomy of media-distribution topologies (point-to-point, forwarding middleboxes, mixing middleboxes) that SFU and MCU correspond to. Tier 1.
  6. Telehealth.HHS.gov — Telehealth for behavioral health care (best-practice guide), https://telehealth.hhs.gov/providers/best-practice-guides/telehealth-for-behavioral-health — federal guidance on virtual behavioral-health delivery; context for group-therapy adoption. Tier 1.
  7. Mend — Group Video Scheduling (provider controls: active-speaker highlighting, moderator layouts), https://mend.com/features/group-video-scheduling/ — clinical group-session features that require per-stream control (an SFU property). Vendor source. Tier 4.
  8. 45 CFR §164.312 — HIPAA Security Rule technical safeguards; 45 CFR §164.308(b) — Business Associate contracts, https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-C/part-164/subpart-C/section-164.312 — the requirement that a server handling PHI be covered by a BAA and the technical safeguards around it; current as of 2026-06-13. Tier 1.
  9. 45 CFR §164.514(a)–(b) — De-identification standard (Safe Harbor / Expert Determination) — referenced for why session media is PHI until de-identified; current as of 2026-06-13. Tier 1.

Where lower-tier vendor sources disagreed with the standards: cost figures from vendor engineering blogs (tier 4) are presented as order-of-magnitude anchors, not precise quotes, and re-verification is flagged for the editor. The PHI-boundary stance follows the conservative reading of 45 CFR §164.308(b) (BAA required for any business associate handling PHI) over the looser "the SFU only forwards encrypted packets, so it is a mere conduit" framing common in vendor material.