Why this matters

If you run an over-the-top (OTT) video service, a music app, or a live event platform, "add Atmos" sounds like a single checkbox and is actually a pipeline decision. The codec you choose decides which devices light up, how much storage and content-delivery cost you add, and how your loudness behaves across phones, soundbars, and headphones. This article explains what the major services actually ship in 2026, why they made those choices, and where the real engineering cost sits. You will finish able to scope an immersive-audio feature and talk to your encoding vendor without bluffing.

First, what "immersive audio" actually means

Stereo gives you two channels: left and right. Surround sound adds more speakers around you — the common layout written as "5.1" means five full-range speakers (front left, front right, center, rear left, rear right) plus one low-frequency effects channel, the ".1". Immersive audio, also called three-dimensional or object-based audio, adds a third dimension: height. Sound can now come from above you, not just around you.

There are two ways to describe a sound scene, and the difference drives everything that follows.

The older way is channel-based: the mix is a fixed set of speaker feeds. A 7.1.4 mix — seven ear-level speakers, one low-frequency channel, four height speakers — is twelve separate audio channels, each pre-assigned to a speaker. If your room has a different speaker count, something has to fold those twelve channels down.

The newer way is object-based: instead of speaker feeds, the mix stores individual sounds (a helicopter, a line of dialogue) as "objects", each with metadata describing where it should be in the room and when. The metadata that travels with an object — its position, size, and level over time — is called Object Audio Metadata (OAMD). At playback, a piece of software called a renderer reads the objects and the metadata and places each sound using whatever speakers or headphones you actually have. The same mix plays correctly on a 5.1 soundbar, a 7.1.4 home theater, or two AirPods.

Dolby Atmos is the best-known object-based system. It uses a hybrid model: a bed of channel-based audio (think of the bed as the background layer — ambience, music, room tone) plus up to 118 objects layered on top. MPEG-H 3D Audio, the standard behind Sony's 360 Reality Audio, is purely object- and scene-based. Both are "immersive"; they differ in how the scene is described and who licenses the technology.

Remember this distinction, because it explains the whole article: a streaming service stores one object-based master and lets each device render it. That is why "Atmos on a phone" and "Atmos on a 7.1.4 home theater" can both come from the same file.

Channel-based audio sends fixed speaker feeds; object-based audio sends sounds plus position metadata that a renderer places on whatever speakers the viewer has. Figure 1. Channel-based vs object-based audio: the channel mix is locked to a speaker map, the object mix is rebuilt for the listener's actual setup.

The bandwidth problem, and the trick that solves it

Here is the obvious tension. A full Atmos master is large. The studio works with the audio as a 48 kHz, 24-bit file with a bed plus dozens of objects — easily tens of megabits per second. You cannot push that down a home internet connection alongside 4K video and expect it to stay smooth. Netflix recommends a connection of only about 3 Mbps for Atmos titles, and the video already eats most of that.

The trick that makes immersive audio shippable is called Joint Object Coding (JOC). The idea is to not transmit every object as its own audio stream. Instead, the encoder renders the immersive scene down to a normal surround bed — a 5.1 or 7.1 channel mix — and compresses that bed with a conventional codec. Alongside the bed it sends a small set of parameters: instructions that tell the decoder how to pull the original objects back out of that downmix. The audio data stays roughly the size of a surround stream; the "3D-ness" rides along as lightweight side data.

A kitchen analogy: rather than shipping every ingredient in a separate box, you ship the finished soup plus a recipe card explaining how to separate it back into ingredients. The soup is the surround bed; the recipe card is the JOC metadata.

On the video side, the codec carrying that bed is Dolby Digital Plus, also written E-AC-3 (Enhanced AC-3). It is standardized as ETSI TS 102 366; the JOC extension lives in Annex E of that standard. E-AC-3 can carry up to fifteen channels and run as high as 6.144 Mbps, but for streaming it is configured far lower. The full name of the streaming format is "Dolby Digital Plus with Joint Object Coding", commonly shortened to DD+ JOC or E-AC-3 JOC.

One property of DD+ JOC matters more than any other for a service operator: it is backward compatible. A decoder that does not understand Atmos simply reads the 5.1 channel core and ignores the JOC side data. A decoder that does understand Atmos reads both and rebuilds the height. This means a service ships one stream, not two, and it plays correctly on a brand-new height-capable receiver and a ten-year-old 5.1 receiver alike. That single fact is the main reason streaming standardized on JOC rather than a separate immersive-only stream.

A studio Atmos master is rendered to a surround bed, compressed as Dolby Digital Plus, with JOC side data carrying the height information; the decoder rebuilds the immersive scene. Figure 2. The JOC pipeline: how a multi-megabit Atmos master becomes a sub-megabit stream that still reconstructs height.

Doing the bitrate arithmetic

It helps to see why the trick is necessary in numbers. Take a 90-minute feature film, which is 5,400 seconds long.

A stereo dub at 128 kbps costs:

128,000 bits/s × 5,400 s ÷ 8 bits/byte = 86,400,000 bytes ≈ 86 MB

A DD+ JOC Atmos track at Netflix's 768 kbps ceiling costs:

768,000 bits/s × 5,400 s ÷ 8 bits/byte = 518,400,000 bytes ≈ 518 MB

So an Atmos track is about six times the size of a stereo language track — meaningful, but small next to the video. The same 90-minute film at a 5 Mbps 1080p video bitrate costs:

5,000,000 bits/s × 5,400 s ÷ 8 bits/byte = 3,375,000,000 bytes ≈ 3.4 GB

The Atmos audio is roughly 15% of the video size. Audio is cheap relative to video; the cost of immersive audio is not the bytes. We return to where the cost actually sits at the end.

What each service actually ships in 2026

The four services in this article split cleanly into two camps: video apps that use DD+ JOC, and music apps that use AC-4 in a headphone-immersive mode. Understanding the split saves you from a common planning mistake — assuming "Atmos" means one thing.

Netflix and Disney+ — DD+ JOC for the living room

Netflix delivers Dolby Atmos as DD+ JOC. The bitrate ceiling is 768 kbps for Premium subscribers, raised from 448 kbps in May 2019. Netflix describes 768 kbps as the point above which "additional quality is imperceivable" — its perceptually transparent ceiling, which is explicitly not lossless. To get Atmos, a viewer needs an Ultra-HD-capable Netflix plan, an Atmos-capable device, an Atmos-capable audio system, and streaming quality set to High or Auto.

Disney+ uses the same DD+ JOC approach for living-room playback. Both services pick E-AC-3 JOC for the same reasons: it is widely supported across TVs, soundbars, receivers, game consoles, and streaming boxes, and the single backward-compatible stream simplifies their delivery.

Apple TV+ — DD+ JOC plus headphone "Spatial Audio"

Apple TV+ delivers immersive film and TV soundtracks as Dolby Atmos over DD+ JOC to a home theater, the same as Netflix. The part that confuses people is Apple Spatial Audio, which is a different thing layered on top.

Apple Spatial Audio is not a codec or a delivery format. It is a playback-side processing system. It takes a 5.1, 7.1, or Atmos source and renders it into a two-channel headphone mix, modeling how that sound would reach your ears in a virtual room. When you use AirPods, sensors inside them (a gyroscope and accelerometer) track your head so the sound field stays anchored in space as you turn — this is "head tracking". The codec carrying the audio to the device is still DD+ JOC or AAC; the spatial effect is computed locally on the device.

So on Apple TV+, the same Atmos master serves three outputs: a full immersive mix on a home-theater system, a head-tracked binaural mix on AirPods, and a plain stereo or 5.1 fold-down on anything else.

Tidal and Apple Music — AC-4 and binaural rendering for headphones

Music services optimize for headphones, not living rooms, so they made a different codec choice.

Tidal delivers Dolby Atmos Music using Dolby AC-4, the codec that succeeds E-AC-3. AC-4 was first standardized by ETSI as TS 103 190 in 2014; its immersive and personalized features are defined in ETSI TS 103 190-2, whose latest revision (V1.3.1) was published in July 2025. For music on headphones, AC-4 runs in an "immersive stereo" mode — the binaural Atmos experience is baked for two-channel headphone playback. Tidal added this on its HiFi tier.

Apple Music takes a related path. It streams Atmos at 768 kbps lossy, then uses Apple's own renderer — built on head-related transfer functions, the mathematical model of how a sound from a given direction reaches each ear — to derive a binaural headphone mix from a virtual 7.1.4 room. A practical caveat worth knowing: Atmos on Apple Music is lossy at 768 kbps, while the same track's stereo version can be offered as lossless. A listener who picks Atmos is trading some stereo fidelity for spatial effect.

Common mistake. Treating "Atmos" as a single deliverable. The Atmos a viewer hears on a Disney+ soundbar (DD+ JOC, channel-bed reconstruction) and the Atmos a listener hears on Tidal through earbuds (AC-4 immersive stereo, binaural render) are produced by different codecs and rendered by different software. If your spec just says "support Atmos", your encoding vendor cannot scope it. Specify the codec, the target devices, and whether the output is speakers or headphones.

Netflix, Disney+, and Apple TV+ ship DD+ JOC to living rooms; Tidal and Apple Music ship AC-4 or lossy Atmos to headphones with on-device binaural rendering. Figure 3. Who ships what: the codec and rendering path differ between video apps and music apps.

How a player knows a track is immersive

A streaming player does not guess. The manifest — the text file that lists every audio and video rendition, written as an HLS playlist or a DASH Media Presentation Description — declares each track's codec and channel configuration, and the player picks based on the device's capabilities. Two attributes do the work, and the exact values come from Dolby's own delivery specifications.

For DD+ JOC in HLS, the CODECS attribute carries the value ec-3, and the #EXT-X-MEDIA line carries a CHANNELS attribute. The CHANNELS value is where the immersive signaling lives: 12/JOC declares a 5.1.4 or 7.1.4 Atmos mix, and 16/JOC declares a 9.1.6 mix. The number is the count of decoded channels; the JOC tag tells the player that Joint Object Coding side data is present, so an Atmos-capable decoder knows to rebuild the height.

For AC-4 immersive stereo — the music-and-headphone path — the recommended CHANNELS value is 2/IMSA,ATMOS. The 2 is the two-channel carrier, IMSA marks immersive stereo for headphone rendering, and ATMOS flags the Atmos experience. For MPEG-H 3D Audio (Sony 360 Reality Audio, ATSC 3.0 broadcast), the codec string is mhm1.

The practical lesson: the bytes alone do not make a track immersive to the player; the signaling does. If the manifest declares ec-3 with CHANNELS="6" instead of 12/JOC, an Atmos-capable client will play the 5.1 core and never reconstruct the height — the most common reason "the Atmos isn't working" turns out to be a manifest bug, not an encode bug.

Comparing the immersive codecs

Criterion DD+ JOC (E-AC-3 JOC) AC-4 (immersive stereo) MPEG-H 3D Audio
Standard ETSI TS 102 366, Annex E ETSI TS 103 190-1 / -2 ISO/IEC 23008-3
Typical streaming use OTT video (Netflix, Disney+, Apple TV+) Music headphone Atmos (Tidal, Apple Music) ATSC 3.0 broadcast, Sony 360RA
Scene model Bed + objects via JOC downmix Bed + objects, channel/object/immersive Channel + object + scene (HOA)
Typical streaming bitrate 384–768 kbps ~256–768 kbps 256–768 kbps
Backward compatible to surround Yes — 5.1 core always present Yes — within AC-4 Yes — within MPEG-H
HLS/DASH codec string ec-3 ac-4 mhm1
Immersive channel signal CHANNELS="12/JOC", "16/JOC" CHANNELS="2/IMSA,ATMOS" profile/level in codec string
Licensing Dolby (proprietary) Dolby (proprietary) Fraunhofer/MPEG (proprietary, marketed open)

The winner for OTT video today is DD+ JOC purely on reach: it plays on the largest set of living-room devices. AC-4 is the better long-term codec — more efficient, more flexible for personalization and dialogue enhancement — and dominates the music-on-headphones use case. MPEG-H leads in broadcast (it is the audio for ATSC 3.0 "NextGen TV") and in object-precise music, but has a smaller OTT footprint.

The production pipeline — where the cost actually sits

The recurring theme of this section: immersive audio is cheap to deliver and expensive to make. Here is the chain.

A mix engineer works in a digital audio workstation with the Dolby Atmos Renderer, which records the mix — bed channels plus object positions over time — into an Atmos master. The universal master deliverable is an ADM BWF file (Audio Definition Model Broadcast Wave Format), confirmed at 48 kHz / 24-bit. ADM is the metadata model that describes the objects and their positions; BWF is the wave-file container that carries the audio plus that metadata.

From the ADM BWF master, an encoder produces the streaming deliverable: Dolby Digital Plus with Atmos content (a .ec3 elementary stream) for video, or an AC-4 stream for music. That audio is then multiplexed with the compressed video (H.264 or H.265) into an MP4 and segmented for delivery — the same packaging path any streaming audio track follows. Cloud encoders such as Dolby's Hybrik automate the render-to-DD+-JOC step at scale.

The bottlenecks are human and computational, not network:

The mastering step needs trained engineers and calibrated rooms. An Atmos mix is authored, not auto-generated; a stereo-to-Atmos upmix exists but rarely satisfies a discerning service.

The encode farm must run object-aware encoders, which are heavier than a stereo AAC encode and licensed per-stream or per-title.

Quality control is the quiet cost. Someone has to verify that the height objects survived encoding, that the 5.1 fold-down still sounds right, that loudness is in-target, and that the manifest declares the correct channel signal. Catching a CHANNELS typo in QC is cheap; catching it from customer complaints is not.

Loudness is part of this. Immersive masters still answer to loudness normalization — the practice of delivering every title at a consistent perceived level, measured in LUFS (Loudness Units relative to Full Scale, the perceptual loudness scale where program material reads as a negative number). Streaming services target around −24 LKFS for some delivery specs and apply dialnorm metadata so quiet and loud titles match. Getting Atmos loudness wrong is one of the most-noticed defects, because viewers reach for the remote.

Where Fora Soft fits in

We build video and audio software across OTT and internet TV, video conferencing, e-learning, telemedicine, surveillance, and AR/VR. When a client wants immersive audio in an OTT product, the work is rarely "turn on Atmos" — it is wiring the encoding pipeline so the right codec reaches the right device, making the player declare and select immersive renditions correctly in HLS and DASH, and building the QC checks that catch a mis-signaled manifest before a subscriber does. We have shipped streaming players and packaging pipelines where audio track selection, fallback to surround, and loudness behavior all had to be correct across phones, smart TVs, and headphones. The same discipline applies whether the target is a film service, a live event platform, or a music app.

What to read next

Download the Immersive Audio Streaming Cheat Sheet (PDF)

Call to action

References

  1. Dolby Laboratories, "What is Dolby Digital Plus JOC (Joint Object Coding)?", Dolby Professional Support — describes the bed-plus-JOC-side-data model and OAMD. Accessed 2026-06-06. https://professionalsupport.dolby.com/s/article/What-is-Dolby-Digital-Plus-JOC-Joint-Object-Coding
  2. ETSI TS 102 366 (Digital Audio Compression / AC-3 and Enhanced AC-3 Standard), Annex E — normative definition of Joint Object Coding carried in E-AC-3. Standards source (tier 1). Cited for the JOC carriage and the E-AC-3 channel/bitrate ceilings. https://www.etsi.org/standards
  3. ETSI TS 103 190-2 V1.3.1 (2025-07), "Digital Audio Compression (AC-4) Standard; Part 2: Immersive and personalized audio". Standards source (tier 1). Cited for AC-4 immersive/personalized modes and the July 2025 revision date. https://www.etsi.org/deliver/etsi_ts/103100_103199/10319002/01.03.01_60/ts_10319002v010301p.pdf
  4. ETSI TS 103 190-1 (2014), "Digital Audio Compression (AC-4) Standard; Part 1" — original AC-4 standardization. Standards source (tier 1). https://www.etsi.org/newsroom/news/783-2014-04-etsi-releases-ac-4-the-new-generation-audio-codec-standard/
  5. Dolby, "Codec type indication for Dolby Atmos content" (Dolby Digital Plus Online Delivery Kit) — HLS ec-3 and CHANNELS="12/JOC" / "16/JOC" signaling. Accessed 2026-06-06. https://ott.dolby.com/OnDelKits/DDP/Dolby_Digital_Plus_Online_Delivery_Kit_v1.4.1/Documentation/Content_Creation/SDM/help_files/topics/c_hls_signal_atmos_ddp.html
  6. Dolby, "Codec type indication for immersive stereo content" (AC-4 Online Delivery Kit) — HLS CHANNELS="2/IMSA,ATMOS" for AC-4 immersive stereo. Accessed 2026-06-06. https://ott.dolby.com/OnDelKits/AC-4/Dolby_AC-4_Online_Delivery_Kit_1.5/Documentation/Specs/AC4_HLS/help_files/topics/hls_playlist_c_codec_indication_ims.html
  7. Netflix Help Center, "Dolby Atmos on Netflix" — plan, device, audio-system, and 3 Mbps connection requirements. Accessed 2026-06-06. https://help.netflix.com/en/node/64066
  8. Netflix Technology / press coverage — DD+ JOC bitrate raised from 448 kbps to 768 kbps (May 2019) and the "additional quality is imperceivable" ceiling. (Tier 4 vendor/secondary; corroborates the bitrate figure cited against Dolby's JOC model in references 1–2.) https://cinemaconfig.com/reference/dd-plus-joc
  9. Dolby Hybrik Documentation, "Dolby Atmos" — render-to-7.1/5.1, DD+ JOC .ec3 encode, mux with H.264/H.265 into MP4, ADM BWF master at 48 kHz/24-bit. Accessed 2026-06-06. https://docs.hybrik.com/tutorials/dolby_atmos/
  10. Dolby Professional, "TIDAL with Dolby AC-4 & Dolby Atmos" — AC-4 delivery of Atmos Music with binaural headphone rendering on the HiFi tier. Accessed 2026-06-06. https://professional.dolby.com/music/tidal-ac4-atmos/
  11. Apple Support, "About Spatial Audio with Dolby Atmos in Apple Music" and "Experience Spatial Audio on Apple TV 4K" — Spatial Audio as on-device HRTF rendering with AirPods head tracking, not a delivery codec. Accessed 2026-06-06. https://support.apple.com/en-us/109354
  12. Fraunhofer IIS / ISO-IEC 23008-3 (MPEG-H 3D Audio) and Sony 360 Reality Audio — object/scene-based immersive, mhm1 codec string, ATSC 3.0 audio. Standards source (tier 1) for the ISO reference. https://www.audioholics.com/news/sony-360-reality-audio-vs.-dolby-atmos-music-the-new-format-wars

Note on source tiers: where a vendor or secondary source (references 8, 12 secondary) reported a number that touches a protocol or format fact, it was checked against the controlling standards document (ETSI TS 102 366 / TS 103 190 / ISO-IEC 23008-3) and Dolby's own delivery kits before inclusion. Internet sources for bitrate figures are corroborated by Dolby's published JOC model; the 768 kbps Netflix ceiling is a service-specific operating point, not a standard.