Why this matters
If you build live video — sports, concerts, news, houses of worship, or a remote-guest interview platform — the audio path is where "we'll fix it later" turns into a broken broadcast. The choices made in the first few seconds of the signal chain decide whether your channels stay synchronized, how much network you need, and how much delay piles up before the audience hears a word. This article is for product managers and engineers who have to specify a live pipeline and talk to broadcast vendors without guessing. You will finish able to tell contribution from distribution, read an AES67 or ST 2110-30 spec sheet, and pick the right codec for each leg of the journey.
Two journeys, not one: contribution vs distribution
The single idea that unlocks this whole topic is that live audio takes two separate trips, and they have opposite priorities.
The first trip is contribution. This is the move from where the sound is captured — a stadium, a concert hall, a remote reporter's kit — back to a production hub or studio. Contribution audio is raw material. It will still be mixed, leveled, dubbed into other languages, and combined with graphics. Because it gets re-encoded later, you keep it at the highest quality you can afford and you keep delay tiny, so the production team can react in real time. Think of contribution as shipping fresh ingredients to a kitchen: you do not vacuum-seal and freeze them, because the chef still has to cook.
The second trip is distribution. This is the move from the finished studio mix out to the audience over the internet (HLS, DASH) or broadcast. Distribution audio is the finished dish. It is compressed hard, because it goes to millions of devices and bandwidth is the dominant cost, and a few seconds of delay is acceptable.
The rest of this article is about the contribution leg, because that is the part most teams underestimate. The codecs, the clocks, and the standards are all different from what you use on the distribution side.
Figure 1. The two journeys of live audio. Contribution favors quality and low delay; distribution favors low bandwidth and scale.
Inside the building: uncompressed audio over IP
Twenty years ago, audio inside a studio moved over dedicated cables — one physical wire per channel, using formats like AES3. Today it moves as data packets over the same Ethernet network that carries everything else. That shift created a need for shared rules so a microphone preamp from one vendor and a mixing console from another could exchange audio without translation. Two such rule sets dominate, and they are deliberately compatible.
AES67: the common language for audio over IP
AES67 is a standard from the Audio Engineering Society, first published in 2013 and revised most recently as AES67-2023. It is not a product and not a network — it is an agreement on how to put professional audio into network packets so that gear from different makers (and competing ecosystems like Dante, Ravenna, and Livewire) can interoperate.
AES67 carries uncompressed audio. There is no codec in the usual sense; the audio is linear PCM — the same raw, sample-by-sample format described in our digital-audio primer — sent as 16-bit (called L16) or 24-bit (L24) samples over RTP, the Real-time Transport Protocol, on top of UDP. The baseline for compliance is 24-bit audio at a 48 kHz sample rate, with 44.1, 88.2, and 96 kHz also supported.
The defining detail of AES67 is how audio is chopped into packets. Audio is sliced into tiny time slices called packet times. The default packet time is 1 millisecond, which at 48 kHz is exactly 48 samples per packet. Shorter packet times (down to 125 microseconds, or 12 samples) cut latency at the cost of more packets per second and more network overhead. The RTP payload is capped at 1460 bytes so a packet fits inside a standard 1500-byte Ethernet frame without fragmenting.
The shared clock: PTP and why it is the whole game
Here is the part that separates audio-over-IP from ordinary networking. Every device on an AES67 network derives its sense of time from one shared "wall clock" distributed across the network using the Precision Time Protocol (PTP), defined in IEEE 1588-2008. A media clock for a 48 kHz stream advances exactly 48,000 samples for every second that ticks on this shared clock.
Why does this matter so much? Because a live mix combines dozens of sources. If the microphone on the left side of the stage and the microphone on the right run on slightly different clocks, their audio drifts apart over minutes, and the mix smears. PTP gives every device the same heartbeat, so 48,000 samples always means the same one second everywhere. On a small gigabit network, a typical AES67 setup (48 samples per packet, 24-bit, 48 kHz) reaches end-to-end latency of about 2–3 milliseconds.
Common mistake: forgetting the clock. Teams new to audio over IP buy compatible AES67 gear, plug it in, and hear clicks, dropouts, or slow drift. The cause is almost always PTP: no grandmaster clock has been elected, or two devices think they are the master, or a non-PTP-aware switch is mangling the timing packets. AES67 interoperability is a clocking problem first and a codec problem never. Budget for PTP-capable switches and design the clock topology before you buy a single console.
SMPTE ST 2110-30: AES67 with broadcast guardrails
If AES67 is the common language, SMPTE ST 2110-30 is the broadcast dialect. ST 2110 is the family of standards for sending separate video, audio, and metadata "essence" streams over IP in a professional facility. Its audio part, ST 2110-30, transports PCM digital audio by directly referencing AES67 — then narrows the options so that broadcast equipment behaves predictably.
It does this with conformance levels. Rather than leaving packet time and channel count fully open, ST 2110-30 defines named tiers. The mandatory one is Level A: 48 kHz, 16- or 24-bit, 1 to 8 channels, with a 1 ms packet time. Levels B and C build on A by allowing more channels per stream, and the "X" variants (AX, BX, CX) add shorter packet times for lower latency. A device that claims Level C must also support Level A, so there is always a common baseline two pieces of gear can fall back to.
There is a companion standard worth knowing: ST 2110-31 carries AES3 audio bit-transparently. That matters when a stream is not plain PCM — for example, a Dolby E or coded surround signal that must pass through the facility untouched. Use -30 for PCM; use -31 when you need to tunnel a pre-encoded audio signal without disturbing it.
| Property | AES67 | SMPTE ST 2110-30 (Level A) |
|---|---|---|
| Audio format | Uncompressed PCM (L16 / L24) | Uncompressed PCM (16 / 24-bit) |
| Sample rates | 44.1 / 48 / 88.2 / 96 kHz | 48 kHz (96 kHz in higher levels) |
| Transport | RTP over UDP | RTP over UDP (per AES67) |
| Clock | PTP (IEEE 1588-2008) | PTP (IEEE 1588-2008) |
| Packet time | 125 µs – 4 ms (1 ms default) | 1 ms (shorter in "X" levels) |
| Channels per stream | Flexible | 1–8 (Level A); more in B / C |
| Designed for | Vendor interoperability | Predictable broadcast deployment |
The practical takeaway: AES67 makes two devices able to talk; ST 2110-30 makes a hundred devices behave the same way in a real broadcast plant. Most professional audio gear sold today supports both.
Figure 2. An IP audio facility: a PTP grandmaster feeds the shared clock; sources and the mixer exchange ST 2110-30 PCM streams over the same network.
The bandwidth math: uncompressed is heavy
Uncompressed audio inside a facility is cheap on a 1- or 10-gigabit LAN, but the numbers are worth seeing so you understand why you cannot send it over the open internet. Bandwidth for raw PCM is simply sample rate times bit depth times channel count:
bandwidth = sample_rate (Hz) × bit_depth (bytes) × channels
A stereo pair at 48 kHz, 24-bit (3 bytes per sample):
48,000 × 3 × 2 = 288,000 bytes/s = 2.304 Mbps
An 8-channel ST 2110-30 stream at the same quality:
48,000 × 3 × 8 = 1,152,000 bytes/s = 9.216 Mbps
That is before packet headers. Nine megabits for eight channels of audio is trivial on a LAN but impossible to guarantee across the public internet for a remote venue. That is exactly why the moment audio leaves the building, the rules change.
Leaving the building: contribution over the public internet
When the venue is not wired into your facility — a reporter in another country, a stadium across the city, a guest at home — you cannot run AES67. The public internet has no shared PTP clock, loses packets, and varies in delay. Contribution over IP solves this with two ingredients: a low-delay codec to shrink the audio, and a reliable transport to survive packet loss.
Low-delay contribution codecs
Distribution codecs like standard AAC-LC are tuned for efficiency and accept tens of milliseconds of delay. Contribution needs the opposite: minimum delay so talent and operators can interact live. The codecs built for this trade a little efficiency for very low latency.
The AAC-ELD family from Fraunhofer is the broadcast workhorse here. AAC-LD reaches a maximum algorithmic delay of about 20 ms; AAC-ELD pushes that to roughly 15 ms at 48 kHz, and in a delay-reduced mode down to about 7.5 ms (a 240-sample block). Opus, the open codec that dominates WebRTC, is also a strong contribution choice: it runs at 2.5–60 ms frames and includes in-band forward error correction (FEC) that rebuilds a lost frame from data carried in the next packet. The European Broadcasting Union codified contribution-codec choices in EBU Tech 3326 (the ACIP standard): G.711, G.722, MPEG Layer 2, and 16-bit PCM are required; AAC and AAC-LD are recommended; Opus is optional — a list that reflects the standard's 2014 vintage more than 2026 practice, where Opus is often the first reach for internet contribution.
Reliable transport: SRT, RIST, Zixi
A low-delay codec is useless if the network drops 5% of its packets and you have no way to recover them. Retransmitting everything (as TCP does) adds too much delay for live. The broadcast answer is a family of UDP-based protocols that add Automatic Repeat reQuest (ARQ) — selectively re-requesting only the packets that were lost, within a tight time budget — plus encryption.
The three you will hear named are SRT (Secure Reliable Transport — open source, the default first choice in 2026, supported by OBS, FFmpeg, and nearly every hardware encoder), RIST (Reliable Internet Stream Transport — the open-standard option with mature hybrid FEC-plus-ARQ), and Zixi (a commercial protocol that sends some redundancy proactively, winning on networks with sustained 3–8% loss). For most single-path internet contribution, SRT is the right answer; the audio rides inside the same SRT tunnel as the video.
Figure 3. Internet contribution: encode with a low-delay codec, protect with a reliable transport, decode and re-mix at the studio.
Where Fora Soft fits in
Fora Soft has built live and real-time media software since 2005, across video conferencing, OTT and internet-TV platforms, e-learning, and telemedicine. The contribution-versus-distribution split shows up in nearly every live project we take on: a remote-guest feature on a streaming platform is a contribution problem (low delay, reliable transport, clean re-mix) bolted onto a distribution problem (adaptive bitrate to the audience). Teams that treat the whole pipeline as one stage end up with drifting audio or unacceptable delay; the engineering value is in designing the two legs separately and stitching the clock and timestamps cleanly between them. That is the kind of architecture decision we help product teams get right before code is written.
What to read next
- Audio in HLS, DASH, CMAF: how a streaming player picks an audio track
- Opus: the open codec that ate WebRTC
- Audio in containers: how MP4, MKV, fMP4, MPEG-TS carry audio
Download the Live Audio Contribution Cheat Sheet (PDF)
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your live audio contribution plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the Live Audio Contribution Cheat Sheet — Contribution vs distribution, AES67 vs SMPTE ST 2110-30 (formats, clock, packet time, channels), the uncompressed-PCM bandwidth math, and the contribution codec / reliable-transport choices on one page.
References
- Audio Engineering Society, "AES67-2023: AES standard for audio applications of networks — High-performance streaming audio-over-IP interoperability." Standards source (tier 1). Cited for L16/L24 PCM over RTP, 24-bit/48 kHz baseline, supported sample rates, 1460-byte payload cap, and PTP timing. https://www.aes.org/publications/standards/search.cfm?docID=96
- SMPTE, "ST 2110-30:2017 — Professional Media Over Managed IP Networks: PCM Digital Audio." Standards source (tier 1). Cited for AES67 reference, conformance levels A/B/C and AX/BX/CX, Level A mandatory profile (48 kHz, 1–8 ch, 1 ms), packet-time range. https://pub.smpte.org/pub/st2110-30/st2110-30-2017.pdf
- SMPTE, "ST 2110-31:2018 — Professional Media Over Managed IP Networks: AES3 Transparent Transport." Standards source (tier 1). Cited for bit-transparent AES3 / Dolby E carriage as the non-PCM alternative to -30. https://pub.smpte.org/doc/st2110-31/20180727-pub/st2110-31-2018.pdf
- IEEE 1588-2008, "Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems" (PTPv2). Standards source (tier 1). Cited for the shared media clock that underpins AES67 and ST 2110. https://standards.ieee.org/ieee/1588/4355/
- EBU, "Tech 3326 — Audio Contribution over IP: Requirements for Interoperability" (ACIP), rev. 2014. Cited for required/recommended/optional contribution codec lists (G.711, G.722, MPEG L2, PCM required; AAC, AAC-LD recommended; Opus optional) and the SIP/RTP framework. https://tech.ebu.ch/docs/tech/tech3326.pdf
- EBU, "Tech 3329 — A Tutorial on Audio Contribution over IP." Cited for the contribution-over-IP framing and the studio-link use case. https://tech.ebu.ch/docs/tech/tech3329.pdf
- Fraunhofer IIS, "AAC-ELD Family" and the AAC-ELD-family technical paper. Cited for AAC-LD ~20 ms maximum algorithmic delay, AAC-ELD ~15 ms at 48 kHz, and the ~7.5 ms delay-reduced 240-sample mode. https://www.iis.fraunhofer.de/en/ff/amm/communication/aaceld.html
- IETF RFC 6716, "Definition of the Opus Audio Codec." Cited for 2.5–60 ms frame range and in-band forward error correction (FEC). https://www.rfc-editor.org/rfc/rfc6716
- Wikipedia, "Reliable Internet Stream Transport (RIST)" and SRT project documentation. Cited for ARQ-over-UDP retransmission, encryption, and the SRT/RIST/Zixi contribution-protocol comparison. https://en.wikipedia.org/wiki/Reliable_Internet_Stream_Transport
- RAVENNA Network, "ST 2110: Audio Transport Methods" and "AES67 Practical Guide." Cited for the 48-samples-per-packet detail, 2–3 ms typical LAN latency, and RAVENNA/AES67 interoperability. https://www.ravenna-network.com/st-2110-audio-transport-methods/
- AIMS Alliance, "AES67 / SMPTE ST 2110 — Commonalities and Constraints." Cited for how ST 2110-30 constrains AES67 and the bandwidth-per-channel arithmetic. https://aimsalliance.org/wp-content/uploads/2019/04/AES67-SMPTE-ST-2110-Commonalities-and-Constraints-Updated-April-2019.pdf
- TV Tech, "SMPTE ST 2110-30: A Fair Hearing for Audio." Cited (secondary) for the conformance-level summary and the 2.304 Mbps stereo / 9.216 Mbps 8-channel bandwidth figures, corroborated against the arithmetic in this article. https://www.tvtechnology.com/opinions/smpte-st-2110-30-a-fair-hearing-for-audio
Note on source tiers: the controlling facts (PCM-over-RTP, conformance levels, PTP timing, codec delays) are taken from the standards bodies (AES, SMPTE, IEEE, EBU, IETF, Fraunhofer). Where a secondary source (references 9, 12) supplied a number, it was checked against the standards arithmetic or the protocol documentation before inclusion.


