Channel-Based vs Object-Based vs Scene-Based Audio

Why This Matters

If you build or buy a product that plays video — streaming, conferencing, OTT, e-learning, telemedicine, surveillance, or VR — someone will eventually ask whether you support "spatial audio," "Dolby Atmos," or "ambisonics." Those three words map onto the three approaches in this article, and confusing them leads to expensive mistakes: shipping a 5.1 pipeline when the client needed object-based delivery, or storing flat stereo when a VR headset needed a rotatable sound field. This article gives a product person the full mental model — what each approach stores, what it costs, where it wins, and how the modern standards (Dolby Atmos, MPEG-H, ambisonics) fit together — so you can scope an audio feature and talk to engineers without guessing. It assumes zero audio background; every term is defined before it is used.

The one question that splits all spatial audio

Every immersive-audio format ever built answers the same question in one of three ways. The question is: what exactly do we write into the file?

You can write down the speakers — "play this signal out of the left-rear speaker." You can write down the sounds and their positions — "the helicopter is up and to the right; figure out which speakers to use." Or you can write down the whole sound field as a stack of mathematical layers and decode it later for any speaker set. That is the entire taxonomy: channel-based, object-based, and scene-based. Everything else — Atmos, DTS:X, MPEG-H, ambisonics, Apple Spatial Audio — is a product built from one or more of these three ideas.

This three-way split is not marketing language. It is written into the international standard that broadcasters and tool-makers use to label audio, Recommendation ITU-R BS.2076, known as the Audio Definition Model or ADM. The ADM is a formal vocabulary — think of it as the grammar that lets a file say "I am this type of audio." Its current edition, BS.2076-3, was published in February 2025. The ADM actually names five types (it adds matrix and binaural as special cases we will cover briefly), but the three that matter for almost every product decision are channel, object, and scene.

Three side-by-side panels comparing channel-based, object-based, and scene-based audio: what each stores, the speaker icon, and a one-line summary Figure 1. The three philosophies of spatial audio, by what they actually write into the file.

Approach 1: Channel-based audio — name the speakers

Channel-based audio is the kind humanity has used since the first wax cylinder. A channel is one stream of audio meant to be played, without modification, out of one speaker in a known position. Stereo is two channels: left and right. "5.1" is six channels. The EBU's ADM guidelines call this type DirectSpeakers precisely to stress what it is — a signal that goes direct to a speaker, no thinking required.

Let us define "5.1" properly, because the notation confuses almost everyone. The first number counts the full-range, ear-level speakers; the second counts a special bass-only channel.

5.1 = five full-range channels (Left, Centre, Right, Left Surround, Right Surround), each carrying roughly 20 Hz to 20,000 Hz, plus one Low-Frequency Effects channel — the "LFE" — that carries only deep bass, about 20–120 Hz. The ".1" means that bass channel uses roughly a tenth of a normal channel's bandwidth. It is the channel your subwoofer plays. It is "point one" because it is a partial channel, not because there is exactly one of it.

A useful analogy: a channel-based mix is a set of pre-addressed envelopes. The studio writes "left-rear speaker" on an envelope and seals the sound inside. As long as your room has a left-rear speaker in the right place, the letter is delivered perfectly.

The strength is obvious — it is dead simple and needs no computing power at playback. The weakness is just as obvious: the mix only sounds right in a room that matches the studio's speaker map. If the studio mixed for five speakers and you have two, something has to fold the five down to two. That folding is called a downmix, and it always loses information. Move a speaker, or remove one, and the spatial picture distorts.

There is a second, subtler weakness that matters for streaming. Each speaker layout is a different file. If you want to serve stereo, 5.1, and 7.1.4 (a layout with overhead speakers — more on the third number shortly), you store and stream three separate audio renditions. We will put numbers on that storage cost later in the article.

How far channel-based goes

Channel-based audio does not stop at 5.1. Layouts climb as rooms add speakers:

7.1 adds two more surround channels for eight ear-level-plus-LFE feeds.
7.1.4 keeps the seven ear-level channels and one LFE, then adds four overhead channels. That third number — the ".4" — counts ceiling speakers. So "7.1.4" is twelve speakers: seven around you, one subwoofer, four above you.
22.2, NHK's broadcast layout, uses twenty-four speakers in three vertical layers.

Notice the pattern. As soon as audio went up — adding height — the channel approach started to strain. Twenty-four pre-addressed envelopes is a lot of envelopes, and almost no living room has the speakers to open them. This strain is exactly what pushed the industry toward the second approach.

Approach 2: Object-based audio — name the sounds, not the speakers

Object-based audio flips the question. Instead of writing "play this out of the left-rear speaker," it writes "here is the sound of a helicopter, and here is its position — up, behind you, drifting left." The file stores the sound (called an audio object) plus metadata: data about the sound, in this case its three-dimensional position over time. A separate piece of software at the playback end, called the renderer, reads the position and figures out which of your speakers should play how much of that sound to put it in the right place.

The analogy: object-based audio is a GPS address, not a sealed envelope. The studio writes "this sound lives at these coordinates." Your renderer is the local driver who knows your neighbourhood — your exact speakers — and delivers the sound to the right spot regardless of how your room is laid out. One address works for every neighbourhood.

This is the central advantage, and it is worth stating plainly: one object-based master plays correctly on headphones, a soundbar, a 5.1 system, or a 64-speaker cinema, with no separate mix for each. The renderer adapts. Channel-based audio needs a new file per layout; object-based audio needs one file plus a renderer.

A signal-flow diagram: an object master (sound plus position metadata) feeds a renderer, which fans out to headphones, soundbar, 5.1, and cinema speaker layouts Figure 2. One object master, many speaker layouts — the renderer does the adapting.

Dolby Atmos is object-based (with a channel-based bed)

The most famous object-based system is Dolby Atmos. Atmos is not pure objects, though — it is a hybrid, and understanding the hybrid is the key to understanding modern immersive audio.

An Atmos mix has two parts. The first is a bed: a set of ordinary channel-based audio (often a 7.1.2 layout — seven ear-level, one LFE, two overhead) used for ambience, music, and anything that does not need to move precisely. The second is a set of objects: individual sounds with positional metadata that the renderer places dynamically. A door slam, a passing car, a line of dialogue — these become objects.

The Dolby Atmos Renderer, the software studios use, accepts up to 128 inputs total, where each channel of a bed counts as one input and each object counts as one input. In a cinema, the bed typically eats 9.1 of those (often described as 7.1.2), leaving up to 118 objects to roam freely. During playback, each cinema's own Atmos system renders those objects in real time against the known positions of that cinema's speakers — which is why the same Atmos master sounds correct in a room with 12 speakers and a room with 64.

The home-delivery trick: spatial coding

Shipping 128 separate channels of audio to a living room is not practical — it would be a huge bitrate. So for home delivery, Atmos uses spatial coding: an algorithm groups the 128 beds and objects into roughly 12 to 16 perceptually distinct "clusters," then carries those clusters plus metadata inside a normal codec. In the Dolby Digital Plus path (technically E-AC-3 with "Joint Object Coding," or JOC), this is how a streaming service squeezes Atmos into a bitrate a home internet connection can handle, while staying backward-compatible — a device that does not understand Atmos still decodes the underlying 5.1.

This is the trick worth remembering for any streaming product: object audio is captured with up to 128 elements in the studio, then intelligently reduced to ~12–16 clusters for delivery. You are not streaming 128 channels.

A worked number, to make the bandwidth concrete. A naive "stream every object separately" approach at, say, 48 kHz and 24 bits per sample, uncompressed, costs per object:

48,000 samples/s × 24 bits/sample = 1,152,000 bits/s = 1.152 Mbps per object
128 objects × 1.152 Mbps        = 147.5 Mbps

147 Mbps for audio alone is absurd for home streaming. Spatial coding down to ~16 clusters carried in E-AC-3 brings practical Atmos delivery to a few hundred kbps — a roughly 100-to-1 reduction. That is the difference between "studio format" and "delivery format," and it is why you should never assume the studio object count equals the streamed channel count.

DTS:X — the other object system

Dolby's main rival, DTS:X, is also object-based and also layout-flexible: it does not lock the mix to a fixed speaker count and lets the decoder adapt to the room. For a product decision, treat DTS:X and Atmos as the same category (object-based, renderer-driven, immersive) competing on ecosystem and licensing rather than on the underlying philosophy.

Approach 3: Scene-based audio — name the whole sound field

The third approach does not name speakers or sounds. It captures the entire sound field at a point in space as a set of mathematical layers, then reconstructs it for any speaker layout — or, crucially, rotates it when the listener turns their head.

This is ambisonics. Here is the plain-language version before any math. Imagine standing at one spot and asking, "from every direction around me, how much sound is arriving and from where?" Ambisonics answers that question with a stack of channels that, taken together, describe the sound arriving from all directions at once. It is not tied to speakers at all; it is a description of the sound at your head.

The lowest, simplest version is First Order Ambisonics (FOA), and it uses exactly four channels, conventionally called W, X, Y, and Z. W is the omnidirectional channel — the total sound pressure, "how much sound overall." X, Y, and Z capture how that sound is distributed along the three axes: front-back, left-right, and up-down. Four numbers, and you have a crude but complete 360° description of the sound field.

To sharpen the picture, you add more layers. This is Higher Order Ambisonics (HOA). The relationship between the order and the channel count is a clean formula, and showing it once makes the whole idea click:

channels = (order + 1)²
order 1 (FOA): (1 + 1)² = 2² = 4 channels
order 2:       (2 + 1)² = 3² = 9 channels
order 3:       (3 + 1)² = 4² = 16 channels
order 7:       (7 + 1)² = 8² = 64 channels

Higher order means finer spatial resolution — sounds localise more precisely — at the cost of more channels to store and transmit. Third order (16 channels) is a common sweet spot for high-quality VR; first order (4 channels) is what most consumer 360° video uses.

The analogy: ambisonics is a panoramic photo of sound. A first-order capture is a low-resolution panorama — you can tell roughly where things are. A third-order capture is a higher-resolution panorama — sharper edges, clearer positions. And like a panorama, you can pan and rotate the whole image after the fact. That last property is the magic.

A diagram of ambisonics orders showing how channel count grows: FOA with 4 channels (W,X,Y,Z), then 9 channels at order 2, then 16 at order 3, with a head-rotation arrow showing the field rotates as a whole Figure 3. Scene-based audio: more order means more channels and sharper spatial resolution. The whole field rotates with the listener's head.

Why VR and 360° video want scene-based audio

When you turn your head in a VR headset, the entire sound field should rotate with you — the sound that was in front is now to your side. With channel-based audio that is hard; with object-based audio it is possible but you must re-render every object. With scene-based audio it is a single, cheap mathematical rotation applied to the whole field at once. Rotate the four (or sixteen) ambisonic channels with one matrix operation and the world turns correctly. That is why head-tracked VR audio is almost always scene-based at its core.

YouTube's spatial-audio support for 360° and VR video is the mainstream example. YouTube accepts First Order Ambisonics as a 4-channel track (ordered W, Y, Z, X) at a 48 kHz sample rate, or FOA plus a "head-locked" stereo pair (a 6-channel track, W, Y, Z, X, L, R, at a minimum 768 kbps) for narration or music that should not rotate with the head. When a viewer drags the 360° view, YouTube rotates the ambisonic field in real time.

A note on exchange formats: ACN and SN3D

If your engineers work with ambisonics, two acronyms will come up. ACN (Ambisonic Channel Number) defines the order the channels are stored in — instead of the historical W, X, Y, Z names, each component gets a number. SN3D (Schmidt Semi-Normalisation) defines how the channels are scaled relative to each other; its practical benefit is that no component ever exceeds the level of the omnidirectional W channel, which helps prevent clipping. The widely used open exchange format AmbiX — introduced in 2011 by researchers at IEM Graz — combines ACN ordering with SN3D normalisation and stores the channels as plain PCM inside a WAV/CAF file. You do not need the math to make a product call, but knowing "we use AmbiX, ACN/SN3D" tells you a team is working in scene-based audio.

The two ADM special cases: matrix and binaural

Two more ADM types round out the picture; both are best understood as derivatives of the main three.

Matrix-based audio combines channels with simple arithmetic to make other channels. The classic example is Mid/Side encoding used in FM radio: the "Mid" channel is left-plus-right, the "Side" channel is left-minus-right. If the weak Side signal is lost, you still recover mono from Mid. The Lt/Rt downmix that folds 5.1 into two channels is also matrix-based. Think of matrix as a reversible folding trick layered on top of channel-based audio.

Binaural audio is the two-channel signal designed specifically for headphones, engineered so your two ears perceive full 3D placement. It is usually the output of a renderer rather than a production format: a renderer takes object-based or scene-based audio and produces a binaural pair using a model of how your head and ears colour sound from each direction (a "head-related transfer function," covered in our ambisonics and HRTF article). Apple Spatial Audio is the consumer face of this — an Atmos (object-based) master rendered to head-tracked binaural for AirPods.

Putting the three side by side

Here is the comparison a product person actually needs. The lightly tinted cells mark where each approach is the natural winner.

Criterion	Channel-based	Object-based	Scene-based
What the file stores	Speaker feeds	Sounds + position metadata	Sound-field layers (W, X, Y, Z…)
Tied to a speaker layout?	Yes — one file per layout	No — renderer adapts	No — decode to any layout
Playback compute cost	Lowest (none)	Medium (renderer)	Medium (decoder/rotation)
Head rotation (VR)	Hard	Possible, re-render each object	One cheap rotation of the field
Precise moving sounds	Limited	Excellent	Good, improves with order
Diffuse ambience / "being there"	Good	OK	Excellent
Typical channel count	2 to 24	1 bed + up to ~128 objects (studio)	4 (FOA) to 16+ (HOA)
Flagship products	Stereo, 5.1, 7.1, 22.2	Dolby Atmos, DTS:X	Ambisonics, YouTube/VR audio
Standard anchors	ITU-R BS.775	ETSI/Dolby; MPEG-H objects	ITU-R BS.2076 (HOA)

Table 1. The three approaches compared on the axes that drive a buy/build decision.

A common mistake worth flagging directly:

Pitfall — "Atmos is just more channels." It is not. Atmos is object-based: it stores sounds and positions, then renders to your speakers at playback. Treating it as "7.1.4 channels" and storing twelve fixed feeds throws away the adaptivity that is the entire point — and it will sound wrong on any layout that is not exactly 7.1.4. If a vendor describes Atmos as a fixed channel count, they have misunderstood the format.

MPEG-H: the standard that does all three at once

If the three approaches feel like rivals, MPEG-H 3D Audio is the standard that refuses to choose. Specified as ISO/IEC 23008-3 (MPEG-H Part 3), it carries channel-based, object-based, and scene-based (HOA) audio in a single bitstream, and lets a content creator mix them — a channel bed, plus moving objects, plus an ambisonic ambience layer, all together. It supports up to 64 loudspeaker output channels and 128 codec core channels.

The practical payoff for viewers is interactivity: because objects carry metadata, an MPEG-H decoder can let the user turn up the dialogue, choose a different commentary language, or adjust the balance — all from the same broadcast stream. That is why broadcasters adopted it. MPEG-H is the sole audio system of South Korea's UHD ATSC 3.0 ("Next Gen TV") service. In the United States, ATSC 3.0's audio standard (A/342) permits both MPEG-H and Dolby's AC-4, and US deployments have largely used AC-4. Brazil's next-generation system, branded DTV+ / "TV 3.0," was adopted in August 2025 and also builds on ATSC 3.0 technology. So the same three-way taxonomy underpins the broadcast standards now rolling out worldwide.

How to choose, in one paragraph

For most streaming and OTT products today, the spine is object-based: author in Dolby Atmos (or MPEG-H where broadcast requires it), deliver via spatial coding so the bitrate stays sane, and let devices render down to whatever the viewer owns. Keep a channel-based stereo and 5.1 fallback for older devices — that is just good hygiene. Reach for scene-based ambisonics specifically when head tracking is in play: VR, AR, and 360° video, where the field must rotate with the viewer. Pure channel-based audio is now mostly a fallback and a capture format, not the thing you build your premium tier around.

A top-down decision tree: start at "is the viewer's head tracked?" branching to scene-based for VR/AR, then "do sounds need precise placement / device adaptivity?" branching to object-based, else channel-based fallback Figure 4. A four-question decision path for picking an audio approach in 2026.

Where Fora Soft fits in

Fora Soft has built video products since 2005 across streaming, OTT/Internet TV, video conferencing, e-learning, telemedicine, surveillance, and AR/VR — and audio is the part of every one of those pipelines that users notice first when it breaks. In OTT and streaming work we wire up object-based delivery (Atmos via E-AC-3 JOC, or MPEG-H for broadcast targets) with channel-based stereo and 5.1 fallbacks so a title plays correctly from a phone to a home cinema. In AR/VR and 360° projects we work in scene-based ambisonics so the sound field tracks the headset. The point of knowing all three approaches is not academic: it is how you scope an audio feature that ships on the devices your users actually have.

Call to action

Talk to a audio engineer — book a 30-minute scoping call to talk through your channel vs object vs scene based audio plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Channel vs Object vs Scene: Decision Cheat Sheet — One page: what each of the three audio approaches stores, where each wins, the 5.1 / 7.1.4 / FOA / HOA notation decoded, and a four-question path to pick an approach in 2026.

References

Recommendation ITU-R BS.2076-3 (02/2025), "Audio Definition Model." The controlling standard that formally defines channel-based, object-based, scene-based (HOA), matrix, and binaural audio types. Tier 1 (official standard). https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2076-3-202502-I!!PDF-E.pdf
EBU ADM Guidelines, "Types of Audio." Open, plain-language companion to ITU-R BS.2076; source for the five-type model, the DirectSpeakers naming, and the HOA component-count examples (FOA = 4 components; 3rd order = 16). Tier 3 (standards-body educational). https://adm.ebu.io/background/audio_types.html
ISO/IEC 23008-3 (MPEG-H Part 3), "3D Audio." Defines MPEG-H 3D Audio carrying channel, object, and HOA elements in one bitstream; up to 64 loudspeaker channels and 128 codec core channels. Tier 1 (official standard); catalogue/abstract read, full normative text paywalled. https://www.iso.org/standard/83525.html
Recommendation ITU-R BS.775 (5.1 reference layout). The 3/2 loudspeaker arrangement underlying channel-based 5.1. Tier 1. https://www.itu.int/rec/R-REC-BS.775
ATSC A/342, "Next Generation Audio" (ATSC 3.0). Specifies that ATSC 3.0 audio may use AC-4 or MPEG-H 3D Audio. Tier 1. https://www.atsc.org/atsc-documents/type/3-0-standards/
Dolby, "Dolby Atmos Renderer" product documentation. Source for the 128-input renderer limit (beds + objects) and the bed/object hybrid model. Tier 3 (first-party vendor). https://professional.dolby.com/product/dolby-atmos-content-creation/dolby-atmos-renderer/
Wikipedia, "Dolby Atmos." Used only for orientation on the spatial-coding cluster count (~12–16 elements) and the E-AC-3 JOC home-delivery path; the underlying claims trace to Dolby documentation. Tier 6 (navigational). https://en.wikipedia.org/wiki/Dolby_Atmos
Google / YouTube Help, "Use spatial audio in 360-degree and VR videos." First-party source for YouTube's First Order Ambisonics support (4-channel W,Y,Z,X at 48 kHz; FOA + head-locked stereo 6-channel, min 768 kbps). Tier 3 (first-party platform docs). https://support.google.com/youtube/answer/6395969
M. Kratschmer et al., "AmbiX — A Suggested Ambisonics Format" (Ambisonics Symposium 2011, IEM Graz). Source for the AmbiX exchange format using ACN ordering and SN3D normalisation. Tier 5 (peer-reviewed). https://ambisonics.iem.at/proceedings-of-the-ambisonics-symposium-2011/ambix-a-suggested-ambisonics-format

Note on source hierarchy (per our research standard): where a vendor description and the standard disagree, the article follows the standard. The three-type taxonomy here is taken from ITU-R BS.2076, not from any single vendor's framing; Atmos and DTS:X are then described as products built within that taxonomy.