Why this matters

If you build live video — sports, concerts, news, houses of worship, or a remote-guest interview platform — the audio path is where "we'll fix it later" turns into a broken broadcast. The choices made in the first few seconds of the signal chain decide whether your channels stay synchronized, how much network you need, and how much delay piles up before the audience hears a word. This article is for product managers and engineers who have to specify a live pipeline and talk to broadcast vendors without guessing. You will finish able to tell contribution from distribution, read an AES67 or ST 2110-30 spec sheet, and pick the right codec for each leg of the journey.

Two journeys, not one: contribution vs distribution

The single idea that unlocks this whole topic is that live audio takes two separate trips, and they have opposite priorities.

The first trip is contribution. This is the move from where the sound is captured — a stadium, a concert hall, a remote reporter's kit — back to a production hub or studio. Contribution audio is raw material. It will still be mixed, leveled, dubbed into other languages, and combined with graphics. Because it gets re-encoded later, you keep it at the highest quality you can afford and you keep delay tiny, so the production team can react in real time. Think of contribution as shipping fresh ingredients to a kitchen: you do not vacuum-seal and freeze them, because the chef still has to cook.

The second trip is distribution. This is the move from the finished studio mix out to the audience over the internet (HLS, DASH) or broadcast. Distribution audio is the finished dish. It is compressed hard, because it goes to millions of devices and bandwidth is the dominant cost, and a few seconds of delay is acceptable.

The rest of this article is about the contribution leg, because that is the part most teams underestimate. The codecs, the clocks, and the standards are all different from what you use on the distribution side.

Live audio takes two trips: a short high-quality contribution hop from venue to studio, then a compressed distribution hop from studio to viewers. The two legs use different standards and codecs. Figure 1. The two journeys of live audio. Contribution favors quality and low delay; distribution favors low bandwidth and scale.

Inside the building: uncompressed audio over IP

Twenty years ago, audio inside a studio moved over dedicated cables — one physical wire per channel, using formats like AES3. Today it moves as data packets over the same Ethernet network that carries everything else. That shift created a need for shared rules so a microphone preamp from one vendor and a mixing console from another could exchange audio without translation. Two such rule sets dominate, and they are deliberately compatible.

AES67: the common language for audio over IP

AES67 is a standard from the Audio Engineering Society, first published in 2013 and revised most recently as AES67-2023. It is not a product and not a network — it is an agreement on how to put professional audio into network packets so that gear from different makers (and competing ecosystems like Dante, Ravenna, and Livewire) can interoperate.

AES67 carries uncompressed audio. There is no codec in the usual sense; the audio is linear PCM — the same raw, sample-by-sample format described in our digital-audio primer — sent as 16-bit (called L16) or 24-bit (L24) samples over RTP, the Real-time Transport Protocol, on top of UDP. The baseline for compliance is 24-bit audio at a 48 kHz sample rate, with 44.1, 88.2, and 96 kHz also supported.

The defining detail of AES67 is how audio is chopped into packets. Audio is sliced into tiny time slices called packet times. The default packet time is 1 millisecond, which at 48 kHz is exactly 48 samples per packet. Shorter packet times (down to 125 microseconds, or 12 samples) cut latency at the cost of more packets per second and more network overhead. The RTP payload is capped at 1460 bytes so a packet fits inside a standard 1500-byte Ethernet frame without fragmenting.

The shared clock: PTP and why it is the whole game

Here is the part that separates audio-over-IP from ordinary networking. Every device on an AES67 network derives its sense of time from one shared "wall clock" distributed across the network using the Precision Time Protocol (PTP), defined in IEEE 1588-2008. A media clock for a 48 kHz stream advances exactly 48,000 samples for every second that ticks on this shared clock.

Why does this matter so much? Because a live mix combines dozens of sources. If the microphone on the left side of the stage and the microphone on the right run on slightly different clocks, their audio drifts apart over minutes, and the mix smears. PTP gives every device the same heartbeat, so 48,000 samples always means the same one second everywhere. On a small gigabit network, a typical AES67 setup (48 samples per packet, 24-bit, 48 kHz) reaches end-to-end latency of about 2–3 milliseconds.

Common mistake: forgetting the clock. Teams new to audio over IP buy compatible AES67 gear, plug it in, and hear clicks, dropouts, or slow drift. The cause is almost always PTP: no grandmaster clock has been elected, or two devices think they are the master, or a non-PTP-aware switch is mangling the timing packets. AES67 interoperability is a clocking problem first and a codec problem never. Budget for PTP-capable switches and design the clock topology before you buy a single console.

SMPTE ST 2110-30: AES67 with broadcast guardrails

If AES67 is the common language, SMPTE ST 2110-30 is the broadcast dialect. ST 2110 is the family of standards for sending separate video, audio, and metadata "essence" streams over IP in a professional facility. Its audio part, ST 2110-30, transports PCM digital audio by directly referencing AES67 — then narrows the options so that broadcast equipment behaves predictably.

It does this with conformance levels. Rather than leaving packet time and channel count fully open, ST 2110-30 defines named tiers. The mandatory one is Level A: 48 kHz, 16- or 24-bit, 1 to 8 channels, with a 1 ms packet time. Levels B and C build on A by allowing more channels per stream, and the "X" variants (AX, BX, CX) add shorter packet times for lower latency. A device that claims Level C must also support Level A, so there is always a common baseline two pieces of gear can fall back to.

There is a companion standard worth knowing: ST 2110-31 carries AES3 audio bit-transparently. That matters when a stream is not plain PCM — for example, a Dolby E or coded surround signal that must pass through the facility untouched. Use -30 for PCM; use -31 when you need to tunnel a pre-encoded audio signal without disturbing it.

Property AES67 SMPTE ST 2110-30 (Level A)
Audio format Uncompressed PCM (L16 / L24) Uncompressed PCM (16 / 24-bit)
Sample rates 44.1 / 48 / 88.2 / 96 kHz 48 kHz (96 kHz in higher levels)
Transport RTP over UDP RTP over UDP (per AES67)
Clock PTP (IEEE 1588-2008) PTP (IEEE 1588-2008)
Packet time 125 µs – 4 ms (1 ms default) 1 ms (shorter in "X" levels)
Channels per stream Flexible 1–8 (Level A); more in B / C
Designed for Vendor interoperability Predictable broadcast deployment

The practical takeaway: AES67 makes two devices able to talk; ST 2110-30 makes a hundred devices behave the same way in a real broadcast plant. Most professional audio gear sold today supports both.

Inside the facility, every device shares one PTP clock. AES67 defines the packet format; SMPTE ST 2110-30 adds conformance levels that constrain packet time and channel count for predictable broadcast behavior. Figure 2. An IP audio facility: a PTP grandmaster feeds the shared clock; sources and the mixer exchange ST 2110-30 PCM streams over the same network.

The bandwidth math: uncompressed is heavy

Uncompressed audio inside a facility is cheap on a 1- or 10-gigabit LAN, but the numbers are worth seeing so you understand why you cannot send it over the open internet. Bandwidth for raw PCM is simply sample rate times bit depth times channel count:

bandwidth = sample_rate (Hz) × bit_depth (bytes) × channels

A stereo pair at 48 kHz, 24-bit (3 bytes per sample):

48,000 × 3 × 2 = 288,000 bytes/s = 2.304 Mbps

An 8-channel ST 2110-30 stream at the same quality:

48,000 × 3 × 8 = 1,152,000 bytes/s = 9.216 Mbps

That is before packet headers. Nine megabits for eight channels of audio is trivial on a LAN but impossible to guarantee across the public internet for a remote venue. That is exactly why the moment audio leaves the building, the rules change.

Leaving the building: contribution over the public internet

When the venue is not wired into your facility — a reporter in another country, a stadium across the city, a guest at home — you cannot run AES67. The public internet has no shared PTP clock, loses packets, and varies in delay. Contribution over IP solves this with two ingredients: a low-delay codec to shrink the audio, and a reliable transport to survive packet loss.

Low-delay contribution codecs

Distribution codecs like standard AAC-LC are tuned for efficiency and accept tens of milliseconds of delay. Contribution needs the opposite: minimum delay so talent and operators can interact live. The codecs built for this trade a little efficiency for very low latency.

The AAC-ELD family from Fraunhofer is the broadcast workhorse here. AAC-LD reaches a maximum algorithmic delay of about 20 ms; AAC-ELD pushes that to roughly 15 ms at 48 kHz, and in a delay-reduced mode down to about 7.5 ms (a 240-sample block). Opus, the open codec that dominates WebRTC, is also a strong contribution choice: it runs at 2.5–60 ms frames and includes in-band forward error correction (FEC) that rebuilds a lost frame from data carried in the next packet. The European Broadcasting Union codified contribution-codec choices in EBU Tech 3326 (the ACIP standard): G.711, G.722, MPEG Layer 2, and 16-bit PCM are required; AAC and AAC-LD are recommended; Opus is optional — a list that reflects the standard's 2014 vintage more than 2026 practice, where Opus is often the first reach for internet contribution.

Reliable transport: SRT, RIST, Zixi

A low-delay codec is useless if the network drops 5% of its packets and you have no way to recover them. Retransmitting everything (as TCP does) adds too much delay for live. The broadcast answer is a family of UDP-based protocols that add Automatic Repeat reQuest (ARQ) — selectively re-requesting only the packets that were lost, within a tight time budget — plus encryption.

The three you will hear named are SRT (Secure Reliable Transport — open source, the default first choice in 2026, supported by OBS, FFmpeg, and nearly every hardware encoder), RIST (Reliable Internet Stream Transport — the open-standard option with mature hybrid FEC-plus-ARQ), and Zixi (a commercial protocol that sends some redundancy proactively, winning on networks with sustained 3–8% loss). For most single-path internet contribution, SRT is the right answer; the audio rides inside the same SRT tunnel as the video.

Over the public internet, contribution uses a low-delay codec (AAC-ELD or Opus) wrapped in a reliable transport (SRT, RIST, or Zixi) that recovers lost packets with ARQ. The studio decodes, mixes, and re-encodes for distribution. Figure 3. Internet contribution: encode with a low-delay codec, protect with a reliable transport, decode and re-mix at the studio.

Where Fora Soft fits in

Fora Soft has built live and real-time media software since 2005, across video conferencing, OTT and internet-TV platforms, e-learning, and telemedicine. The contribution-versus-distribution split shows up in nearly every live project we take on: a remote-guest feature on a streaming platform is a contribution problem (low delay, reliable transport, clean re-mix) bolted onto a distribution problem (adaptive bitrate to the audience). Teams that treat the whole pipeline as one stage end up with drifting audio or unacceptable delay; the engineering value is in designing the two legs separately and stitching the clock and timestamps cleanly between them. That is the kind of architecture decision we help product teams get right before code is written.

What to read next

Download the Live Audio Contribution Cheat Sheet (PDF)

Call to action

References

  1. Audio Engineering Society, "AES67-2023: AES standard for audio applications of networks — High-performance streaming audio-over-IP interoperability." Standards source (tier 1). Cited for L16/L24 PCM over RTP, 24-bit/48 kHz baseline, supported sample rates, 1460-byte payload cap, and PTP timing. https://www.aes.org/publications/standards/search.cfm?docID=96
  2. SMPTE, "ST 2110-30:2017 — Professional Media Over Managed IP Networks: PCM Digital Audio." Standards source (tier 1). Cited for AES67 reference, conformance levels A/B/C and AX/BX/CX, Level A mandatory profile (48 kHz, 1–8 ch, 1 ms), packet-time range. https://pub.smpte.org/pub/st2110-30/st2110-30-2017.pdf
  3. SMPTE, "ST 2110-31:2018 — Professional Media Over Managed IP Networks: AES3 Transparent Transport." Standards source (tier 1). Cited for bit-transparent AES3 / Dolby E carriage as the non-PCM alternative to -30. https://pub.smpte.org/doc/st2110-31/20180727-pub/st2110-31-2018.pdf
  4. IEEE 1588-2008, "Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems" (PTPv2). Standards source (tier 1). Cited for the shared media clock that underpins AES67 and ST 2110. https://standards.ieee.org/ieee/1588/4355/
  5. EBU, "Tech 3326 — Audio Contribution over IP: Requirements for Interoperability" (ACIP), rev. 2014. Cited for required/recommended/optional contribution codec lists (G.711, G.722, MPEG L2, PCM required; AAC, AAC-LD recommended; Opus optional) and the SIP/RTP framework. https://tech.ebu.ch/docs/tech/tech3326.pdf
  6. EBU, "Tech 3329 — A Tutorial on Audio Contribution over IP." Cited for the contribution-over-IP framing and the studio-link use case. https://tech.ebu.ch/docs/tech/tech3329.pdf
  7. Fraunhofer IIS, "AAC-ELD Family" and the AAC-ELD-family technical paper. Cited for AAC-LD ~20 ms maximum algorithmic delay, AAC-ELD ~15 ms at 48 kHz, and the ~7.5 ms delay-reduced 240-sample mode. https://www.iis.fraunhofer.de/en/ff/amm/communication/aaceld.html
  8. IETF RFC 6716, "Definition of the Opus Audio Codec." Cited for 2.5–60 ms frame range and in-band forward error correction (FEC). https://www.rfc-editor.org/rfc/rfc6716
  9. Wikipedia, "Reliable Internet Stream Transport (RIST)" and SRT project documentation. Cited for ARQ-over-UDP retransmission, encryption, and the SRT/RIST/Zixi contribution-protocol comparison. https://en.wikipedia.org/wiki/Reliable_Internet_Stream_Transport
  10. RAVENNA Network, "ST 2110: Audio Transport Methods" and "AES67 Practical Guide." Cited for the 48-samples-per-packet detail, 2–3 ms typical LAN latency, and RAVENNA/AES67 interoperability. https://www.ravenna-network.com/st-2110-audio-transport-methods/
  11. AIMS Alliance, "AES67 / SMPTE ST 2110 — Commonalities and Constraints." Cited for how ST 2110-30 constrains AES67 and the bandwidth-per-channel arithmetic. https://aimsalliance.org/wp-content/uploads/2019/04/AES67-SMPTE-ST-2110-Commonalities-and-Constraints-Updated-April-2019.pdf
  12. TV Tech, "SMPTE ST 2110-30: A Fair Hearing for Audio." Cited (secondary) for the conformance-level summary and the 2.304 Mbps stereo / 9.216 Mbps 8-channel bandwidth figures, corroborated against the arithmetic in this article. https://www.tvtechnology.com/opinions/smpte-st-2110-30-a-fair-hearing-for-audio

Note on source tiers: the controlling facts (PCM-over-RTP, conformance levels, PTP timing, codec delays) are taken from the standards bodies (AES, SMPTE, IEEE, EBU, IETF, Fraunhofer). Where a secondary source (references 9, 12) supplied a number, it was checked against the standards arithmetic or the protocol documentation before inclusion.