Published 2026-06-02 · 20 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

The moment your product lets strangers stream live to each other — a social app, a dating service, a marketplace with live video, a virtual classroom, a telehealth waiting room — you have inherited a safety problem that arrives in real time and cannot be cleaned up after the fact. A harmful frame that reaches a viewer has already done its damage; a piece of illegal material that passes through your servers has already created a legal obligation. This article is for the product manager, founder, or engineering lead who has to scope a moderation feature, set a sane latency and cost budget, and talk to engineers and lawyers without drowning in either's jargon. It explains where moderation runs in a live system, what it actually inspects, what it costs, and which parts the law decides for you rather than leaving to your taste. For the offline, archive-scale version of this problem — moderating millions of already-recorded clips — see the companion UGC moderation pipeline capstone; this article is about the live path.

What "Real-Time Content Moderation" Actually Means

Start with the words, because "moderation" hides two very different jobs that people constantly confuse, and the confusion leads to systems that miss their whole point.

The first job is after-the-fact moderation. A user uploads a video; sometime in the next seconds or minutes a system inspects it and decides whether to publish it. There is slack in the schedule — if the check takes thirty seconds, nobody is harmed, because nothing was visible yet. The second job is real-time moderation, and it has no slack at all. The content is already live: a person is on camera right now, talking to other people right now, and if they expose something harmful, the viewers see it the instant it happens. Your moderation has to detect and act inside the brief window before the harmful moment propagates, which in a live call is a fraction of a second to a couple of seconds. The difference between the two jobs is the difference between a security guard reviewing yesterday's tapes and a security guard watching the monitors live. This article is entirely about the second guard.

That live constraint reshapes everything. You cannot wait for a slow, careful, expensive analysis, because by the time it finishes the harm is done. You cannot ask a human to look at every stream, because you may have thousands at once and humans are slow and few. And you cannot simply block first and check later for everyone, because that would make the product unusable for the overwhelming majority who do nothing wrong. Real-time moderation is therefore always a compromise between speed, cost, and accuracy — and the art is deciding, per kind of harm, where on that triangle to sit.

The SFU Is The One Place That Sees Everything

Now the architecture, and the good news is that you almost certainly already run the piece you need. In a call with more than a couple of participants, the audio and video do not fly directly between everyone. They pass through a media server in the middle called a Selective Forwarding Unit, abbreviated SFU — a router that receives one stream from each participant and forwards the right streams on to everyone else. We describe that server and the browser hooks around it in the WebRTC AI integration article.

The SFU is special for moderation for the same reason it is special for live captions and live translation: it is the single point that sees every participant's media, separated by person, in real time. That makes it the natural home for the moderation pipeline. You tap a stream once at the SFU, run the checks once, and act on the result once — rather than trying to run a classifier inside every participant's browser, where you cannot trust the code and cannot see the other participants. The pattern is identical to the SFU-side fan-out we use for captions: one tap, central processing, one decision distributed back out.

There are three separate things to moderate, and they need three separate checks, because they are different kinds of signal. There is video — the frames, where you look for nudity, sexual activity, violence, weapons, gore, and known illegal imagery. There is audio — the spoken track, where you look for threats, hate speech, sexual harassment, and slurs. And there is text — the chat messages, usually sent over the WebRTC data channel, the standard side-path for arbitrary data between participants defined in IETF RFC 8831, where you look for abuse, scams, doxxing, and links to harmful sites. A complete system runs all three; many teams ship video first because it carries the most severe, least-deniable harms.

A diagram on a white background titled "The SFU is the one place that sees every live stream." On the left, three participant cards labelled "Camera + mic + chat" send arrows into a central rounded box labelled "SFU — the media router you already run." Inside the SFU box, three small taps branch downward to a stacked "Moderation pipeline" panel showing three lanes: "Video frames — nudity, violence, known illegal imagery," "Audio track — threats, hate speech, harassment," and "Text chat (RFC 8831 data channel) — abuse, scams, doxxing." A single arrow labelled "one decision" returns from the pipeline back into the SFU, and from the SFU three "forward / block" arrows fan back out to the participants. A footer note reads: tap once, check once, act once — never trust a classifier running inside the participant's browser. Figure 1. In a group call the moderation pipeline lives at the SFU, the one server that already sees every participant's video, audio, and chat. Tap each stream once and distribute a single decision back out.

The Encryption Fork That Decides Your Whole Design

Before you write a line of pipeline code, there is one architectural fork that quietly settles whether server-side moderation is even possible, and most teams meet it the hard way. It is about encryption.

A normal WebRTC call is encrypted in transit, which means the media is scrambled while it travels over the network but is readable at the SFU, because the SFU has to decode it to forward it. That readability is exactly what lets the SFU moderate. But there is a stronger privacy mode called end-to-end encryption, abbreviated E2EE, in which the media is scrambled by the sender and can only be unscrambled by the receivers — not by the server in the middle. The standard that does this for real-time media is SFrame, published as IETF RFC 9605 in August 2024. SFrame is clever: it encrypts the media frames so the SFU still sees enough metadata to route them, but cannot see the picture or hear the sound. The SFU forwards sealed envelopes without being able to open them.

Here is the fork. The same property that makes E2EE private — the server cannot read the media — makes server-side moderation impossible, because you cannot classify a picture you cannot see. You genuinely cannot have both true end-to-end encryption and server-side content moderation on the same stream; it is a contradiction, not an engineering gap you can close with more effort. This is the single most important sentence in the article, and it is the one most often discovered after the architecture is already built.

You have three honest ways out, and you choose per product. You can run transport-only encryption and moderate at the SFU — the right call for an open social or dating product where safety outranks maximal privacy. You can run true E2EE and move moderation onto the device, where the media is readable, accepting that you now trust client code and cannot see across participants — the call a privacy-first messenger makes. Or you can run a hybrid, where most calls are E2EE but flagged or high-risk sessions drop to server-readable so they can be inspected, with that downgrade disclosed to users. There is no fourth option that gives you both at once. Decide this first, because it determines whether the rest of this article applies to your server or to your client.

A decision diagram on a white background titled "You cannot have both E2EE and server-side moderation — pick a path." At the top, a diamond decision node reads "Is the media end-to-end encrypted (SFrame, RFC 9605)?" Two branches descend. The left branch, labelled "No — transport-only encryption," leads to a green box "SFU can read the media → moderate at the server" with a note "best for open social, dating, marketplace, live commerce." The right branch, labelled "Yes — true E2EE," leads to an orange box "SFU sees sealed frames → server moderation impossible" which splits into two outcome cards: "Move moderation onto each device (trust client code, no cross-participant view)" and "Hybrid: drop flagged sessions to server-readable, disclosed to users." A footer note reads: this fork decides whether moderation lives on your server or your client — settle it before anything else. Figure 2. End-to-end encryption and server-side moderation are mutually exclusive on the same stream. The fork — moderate at the SFU, on the device, or hybrid — is the first decision, not the last.

You Cannot Look At Every Frame — So You Sample

Assume you took the server-readable path. The next reflex most people have is wrong, and the arithmetic shows why. The reflex is to run the moderation classifier on every video frame. A live video runs at perhaps thirty frames per second, abbreviated fps — thirty still pictures every second, shown fast enough to look like motion. Imagine paying a vision classifier to inspect all thirty.

Suppose one image check costs $0.001 — a tenth of a cent, a realistic 2026 price for a hosted moderation API. Run it on every frame of one stream:

30 fps × 60 s × $0.001 = $1.80 per minute, per stream

For one stream that is already painful; for a thousand concurrent streams it is $1,800 a minute, which is absurd for a feature that earns nothing. So you do not inspect every frame. You sample: you inspect a few frames a second and trust that harm visible to a human persists across many frames, so a sample will catch it. Sampling at two frames per second instead of thirty cuts the same bill by fifteen:

2 fps × 60 s × $0.001 = $0.12 per minute, per stream

Two frames a second is enough because harmful content that a viewer can perceive does not flash by in a single thirtieth of a second; it lingers, and a check twice a second will land on it. You make the sample smarter still by adding the frames that matter most: the keyframes (the periodic complete pictures the video codec already produces) and scene-change frames (moments when the picture shifts a lot, which is when new content appears). Audio gets the same treatment through Voice Activity Detection, abbreviated VAD — cheap software that answers "is anyone speaking right now?" — so the expensive speech check runs only while someone talks, not during silence. The discipline is the same one that keeps captions and translation affordable: detect that there is something worth checking, then spend money checking it.

The Layered Pipeline: Cheap And Certain Before Expensive And Fuzzy

With sampling deciding when to check, the pipeline decides what to run, and the rule is to order the checks from cheapest-and-most-certain to most-expensive-and-fuzziest, stopping as early as you can.

The first layer is hash matching for known illegal imagery. A hash is a digital fingerprint — a short string of numbers computed from an image. The well-known system, PhotoDNA, built by Microsoft and licensed to Google, Meta, and others, computes a fingerprint that survives small edits like resizing or recoloring, so an image already identified as illegal is recognized even when it has been altered to hide. You compare each sampled frame's fingerprint against a database of fingerprints of known child-sexual-abuse material; a match is fast, cheap, and near-certain, and it does not require any AI judgment about a picture it has never seen. This layer runs first because it is the most reliable and the most legally consequential.

The second layer is visual classification for harm that has no fingerprint because it is new: nudity, sexual activity, violence, weapons, self-harm, gore. This is the AI vision model — Hive, AWS Rekognition, Azure AI Content Safety, Google Cloud Vision — that looks at a frame and returns categories with confidence scores, for example "explicit nudity: 0.97". It is more expensive and less certain than a hash, so it runs on the sampled frames that passed the hash check. When a frontier multimodal model is a better fit than a dedicated classifier — for nuanced or context-dependent calls — the just-use-a-VLM decision applies here too.

The third layer is audio moderation. The spoken track is turned into text by automatic speech recognition and the text is scored for threats, hate speech, and harassment; some engines, like Amazon Transcribe Toxicity Detection, also read acoustic cues such as shouting to catch toxic intent that the words alone might miss. The fourth layer is text-chat moderation — scoring the messages on the data channel for abuse, scams, and doxxing, the cheapest check of all because text is tiny. Laid side by side, the four media line up like this:

Media What you look for Typical tool Cost shape First action
Video frame (hash) Known illegal imagery (CSAM) PhotoDNA / perceptual hash Very cheap, deterministic Block + preserve + report
Video frame (classifier) New nudity, violence, weapons, gore Hive, Rekognition, Azure, Google Moderate, per sampled frame Blur / suspend / review
Audio Threats, hate speech, harassment ASR + text classifier; Transcribe Toxicity Moderate, VAD-gated Mute / warn / review
Text chat Abuse, scams, doxxing, bad links Text moderation API Very cheap Hide / warn / block

A left-to-right pipeline diagram on a white background titled "The layered moderation pipeline — cheap and certain before expensive and fuzzy." On the left, a "Sampled input" card lists "2 fps frames + keyframes + scene changes" and "VAD-gated audio." Four processing boxes follow in sequence, each tinted differently and labelled with its cost: "1. Hash match (known CSAM) — cheapest, near-certain," "2. Visual classifier (new nudity / violence) — moderate," "3. Audio toxicity (ASR + text) — moderate," "4. Text-chat classifier — cheapest." Each box emits a small "confidence score" chip. All four feed a right-hand decision box labelled "Confidence → action: allow · blur/mute · human review · block · report." A footer note reads: stop at the earliest layer that fires; a hash hit never needs the classifier. Figure 3. Order the checks cheapest-and-most-certain first. A fingerprint match for known illegal imagery is fast and decisive; the fuzzy AI classifier only runs on frames that survive it.

Child-Sexual-Abuse Material Is A Category Apart

Most moderation is a matter of policy and taste — your product decides how much skin or profanity it tolerates. One category is not, and you design for it before any other, because the law has already decided.

Child-sexual-abuse material, abbreviated CSAM, is illegal everywhere and is governed by rules that override your preferences. In the United States, once a provider becomes aware of apparent CSAM on its service, federal law — 18 U.S.C. § 2258A — requires it to report to the CyberTipline run by the National Center for Missing and Exploited Children, abbreviated NCMEC, the central intake point that forwards reports to law enforcement. Two engineering consequences follow that are easy to get wrong. First, you may not simply delete the material when you find it; you must preserve it and the associated data as evidence for a period the statute sets, so your "block" action for this category writes to a sealed evidence store, not to the trash. Second, the law requires you to report what you become aware of — it does not force you to go searching, and there is no general-monitoring mandate — but the moment your hash matcher fires, you are aware, and the clock starts.

This is why hash matching sits first in the pipeline and why its "block" action is different from every other block. A nudity classifier firing means hide this and maybe review it; a CSAM hash firing means stop the stream, seal the evidence, and file a report. Build that path deliberately, with your legal team, on day one. Bolting it on after launch is how products end up either breaking the law by deleting evidence or breaking trust by mishandling the most sensitive data they will ever touch. The detection technology — the perceptual hash — is well understood; the obligation around it is the part that must be engineered with care.

From A Confidence Score To An Action

A classifier does not return "bad." It returns a number — a confidence score between zero and one — and your system's real intelligence is the ladder that turns that number into a proportionate action. Getting the ladder wrong is how you either drive away innocent users or let harm through.

The logic is threshold-based. Above a high confidence — say 0.95 — you act automatically and immediately: blur the video, mute the audio, or suspend the stream, because being nearly certain justifies acting without a human. In a middle band — say 0.70 to 0.95 — you do something reversible and route the moment to a human review queue, because the machine is suspicious but not sure, and a person should make the call before you punish a user. Below the low threshold you allow it, logging the score so you can tune later. The two ways to be wrong pull in opposite directions: a false positive blurs or kicks someone who did nothing wrong and makes your product feel hostile, while a false negative lets real harm reach viewers. You cannot drive both to zero at once — pushing the threshold down catches more harm but punishes more innocents — so you set the balance per category. You are strict with severe, irreversible harms and lenient with borderline ones, and you keep a human in the loop wherever the score is uncertain and the cost of a mistake is high.

A diagram on a white background titled "From a confidence score to a proportionate action." A vertical scale on the left runs from 0.0 at the bottom to 1.0 at the top, marked with two thresholds. Three horizontal bands sit against the scale. The top band (0.95–1.0), tinted green-to-act, is labelled "High confidence → act automatically: blur, mute, suspend." The middle band (0.70–0.95), tinted amber, is labelled "Uncertain → reversible action + human review queue." The bottom band (0.0–0.70), tinted neutral, is labelled "Low → allow, log the score for tuning." To the right, two opposing arrows illustrate the trade-off: one labelled "lower the threshold → catch more harm, punish more innocents (false positives)" and one labelled "raise the threshold → fewer false alarms, more harm slips through (false negatives)." A separate red side-path from the top connects to a box "CSAM hash hit → not a threshold: stop, seal evidence, report to NCMEC." A footer note reads: set the balance per category — strict on severe harm, a human wherever it is uncertain. Figure 4. A score is not a decision. High confidence acts automatically, the uncertain middle goes to a human, and known illegal imagery skips the ladder entirely for a fixed legal path.

The Latency And Cost Budget, Out Loud

Two numbers decide whether the design is shippable: how much delay moderation adds, and how much it costs. Both are budgets you set on purpose.

Latency first. For the moderation to prevent harm rather than merely document it, the decision has to land before the harmful frame reaches viewers. The lever that buys you the time is the same one streaming platforms already use: a small broadcast delay. By holding the live stream back by a second or two before it reaches the audience — the modern echo of television's "seven-second delay" — you give the pipeline a window to sample a frame, score it, and act while the content is still in the buffer rather than on the viewers' screens. In a tight conferencing call where even a second of delay hurts the conversation, you cannot buy that window, so you accept that moderation is reactive — it cuts the offender off a beat after they start, rather than before — and you lean harder on fast audio and after-the-fact suspension. The whole-call version of this timing discipline is the subject of the sub-100-millisecond latency budget article.

Cost second, and it is the sampling math from earlier carried to the fleet. Take a platform with five hundred concurrent live streams, moderating video at two sampled frames per second at $0.001 a frame, with audio and text adding roughly half again as much:

video = 500 streams × 2 fps × 60 s × $0.001      = $60.00 per minute
audio + text ≈ 0.5 × video                        = $30.00 per minute
total ≈ $90 per minute ≈ $5,400 per hour of peak concurrency

That number is a design output, not a fact of nature, and three levers move it. Lowering the sample rate cuts it linearly but risks missing brief harms. Running the cheap hash and text layers on everything while reserving the expensive visual classifier for streams that some signal already flagged cuts it sharply. And self-hosting an open visual classifier instead of paying per call flips the cost from per-frame to fixed infrastructure, which pays back above a volume that compliance, not arithmetic, usually decides. The point of writing the budget out loud is that "moderate everything, always, at full frame rate" is not a plan — it is a way to discover, in production, that safety can cost more than the product earns.

A Common Mistake: Moderating The Recording, Not The Live Stream

The most damaging error in this area is subtle because the system looks like it works. A team wires their moderation to the place it is easiest to wire it — the recording pipeline, which already saves every session to storage and is a tidy, file-shaped thing to hand a classifier. The dashboard fills with flags. Everyone relaxes.

But the recording is written after the live stream has already played to the audience. Moderating it catches harms an hour too late to stop them — it produces a report, not a defense. The viewers already saw the harmful frame; the only thing the late check changes is that you now have a record of the harm you failed to prevent. The fix is the architecture this whole article describes: tap the live media at the SFU, in transit, and act inside the broadcast-delay window — not on the file that lands afterward. Keep the recording check too, as a slower and more thorough backstop that can use heavier models and human reviewers, but never mistake it for real-time moderation. A safety system that always arrives after the harm is not a safety system; it is an archive of your failures.

Build With A Framework, Buy The Classifiers, Staff The Queue

Once the pattern is settled, the practical question is what you assemble it from, and as with the rest of this stack the layers are independent.

The media layer is the call itself. You can run an open-source SFU — mediasoup, Janus, LiveKit's server — and own the frame tap and the block action yourself, reading frames out of the stream with the browser's Encoded Transform and Insertable Streams hooks that we cover in the WebRTC AI integration article, or use a hosted real-time platform that exposes participant media to a server-side agent with less plumbing. LiveKit in particular treats a server-side agent as a first-class citizen, which is why it shows up in production moderation builds.

The classifier layer is where you almost always buy or self-host rather than train from scratch. The hosted options — Hive, AWS Rekognition, Azure AI Content Safety, Google Cloud Vision, OpenAI's free moderation endpoint for text and images — differ less in raw capability than in which cloud you already live in and how they price; specialist vendors like Hive trade higher per-call cost for higher precision on the hardest visual categories. For CSAM specifically you do not shop on a marketplace: you apply to Microsoft for PhotoDNA or work with NCMEC and the Tech Coalition, because the fingerprint database is restricted by design. The support layer is the part teams under-budget and then regret: the human review queue and the people to staff it, the sealed evidence store for the legal category, an audit log of every decision (which the EU Digital Services Act expects you to be able to explain), and an appeals path so a user wrongly blocked can get a human to look again. The pipeline is the spine; the queue, the log, and the appeal are what make it lawful and humane.

Where Fora Soft Fits In

We build the live-video products where moderation is not optional — social and dating apps where strangers meet on camera, marketplaces and live-commerce platforms with open video, e-learning and telemedicine systems where a duty of care is explicit — and we build moderation the way this article argues for it: at the SFU you already run, tapping video, audio, and chat once, gating the spend with sampling and voice activity so cost tracks risk rather than seats. Because we work across surveillance, conferencing, and regulated verticals, we treat the encryption fork as a first-class design decision rather than a surprise, and we plan the unglamorous parts — the human review queue, the sealed evidence path for illegal material, the audit log a regulator can read — from the first sprint instead of after the first incident. The detection models change every year; the architecture that decides where they run, what they cost, and who reviews their mistakes is the part worth getting right once.

What To Read Next

Talk To Us / See Our Work / Download

  • Talk to a video engineer about adding real-time moderation to your WebRTC product → /livekit-ai-agent-development-experts
  • See our case studies in social, dating, marketplace, conferencing, and surveillance video → /cases
  • Download the Real-Time Content Moderation Engineering Cheat Sheet (one page, printable) → Download the cheat sheet

References

  1. IETF. RFC 9605 — Secure Frame (SFrame): Lightweight Authenticated Encryption for Real-Time Media, Proposed Standard, August 2024, accessed 2026-06-02. https://www.rfc-editor.org/rfc/rfc9605. Primary standards source for end-to-end encryption of WebRTC media frames: the SFU sees routing metadata but not media content. The controlling document for the article's central "E2EE and server-side moderation are mutually exclusive" fork.
  2. IETF. RFC 8831 — WebRTC Data Channels, Standards Track, January 2021, accessed 2026-06-02. https://www.rfc-editor.org/rfc/rfc8831. Primary standards source for the transport that carries text chat between participants — the channel the text-moderation layer inspects.
  3. IETF. RFC 6716 — Definition of the Opus Audio Codec, Standards Track, September 2012, accessed 2026-06-02. https://www.rfc-editor.org/rfc/rfc6716. Primary standards source for the audio codec carrying the spoken track that the audio-moderation layer taps and feeds to ASR.
  4. European Union. Regulation (EU) 2022/2065 — Digital Services Act (DSA), EUR-Lex, in force, generally applicable from 17 February 2024, accessed 2026-06-02. https://eur-lex.europa.eu/eli/reg/2022/2065/oj. Primary legal source for the EU notice-and-action, statement-of-reasons, transparency-reporting, and audit-log obligations referenced in the support-layer and regulation discussion.
  5. United States Code. 18 U.S.C. § 2258A — Reporting requirements of providers, Office of the Law Revision Counsel, accessed 2026-06-02. https://uscode.house.gov/view.xhtml?req=(title:18%20section:2258A). Primary legal source for the U.S. obligation to report apparent CSAM to the NCMEC CyberTipline on awareness, and the evidence-preservation requirement that shapes the CSAM "block" action.
  6. National Center for Missing & Exploited Children (NCMEC). CyberTipline, accessed 2026-06-02. https://www.missingkids.org/gethelpnow/cybertipline. Authoritative source for the central CSAM-reporting intake that forwards provider reports to law enforcement — the destination of the legal path in Figure 4.
  7. Microsoft. PhotoDNA, accessed 2026-06-02. https://www.microsoft.com/en-us/photodna. Vendor source for the perceptual-hash fingerprinting that survives small edits and underpins the first, deterministic layer of the pipeline; licensed to Google, Meta, and others.
  8. Amazon Web Services. Detecting toxic speech — Amazon Transcribe Toxicity Detection, AWS Documentation, accessed 2026-06-02. https://docs.aws.amazon.com/transcribe/latest/dg/toxicity.html. Vendor source for audio moderation that combines transcribed text with acoustic cues (tone, pitch, shouting) across seven toxicity categories, cited in the audio layer.
  9. Amazon Web Services. Moderating content — Amazon Rekognition Content Moderation, AWS Documentation, accessed 2026-06-02. https://docs.aws.amazon.com/rekognition/latest/dg/moderation.html. Vendor source for image/video moderation labels and confidence scores, and the per-image pricing shape used in the cost arithmetic.
  10. Microsoft. Azure AI Content Safety — harm categories and severity levels, Microsoft Learn, accessed 2026-06-02. https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/harm-categories. Vendor source for the four-category (hate, sexual, violence, self-harm) text-and-image model with configurable severity thresholds, cited in the classifier layer and the confidence-to-action ladder.
  11. Hive. Moderation — visual, audio, text, and livestream moderation, accessed 2026-06-02. https://thehive.ai/. Vendor source for the specialist real-time moderation platform covering image, video, audio, and live-stream, used as the higher-precision (higher-cost) option in the build-vs-buy discussion.
  12. Bhuiyan, Lu, et al. Toward Accessible and Safe Live Streaming Using Distributed Content Filtering with MoQ, arXiv:2505.08990, 2025, accessed 2026-06-02. https://arxiv.org/abs/2505.08990. Academic source for real-time live-stream content filtering under latency constraints, supporting the sampling-and-broadcast-delay framing of the latency budget.