MPEG-H 3D Audio: The ISO Open Immersive Standard

Why this matters

If you build an OTT service, an Internet-TV product, an immersive-music app, or anything that sends sound to a modern smart TV or a pair of headphones, MPEG-H is one of two audio systems the new broadcast era runs on, and a product decision about it is a decision about your whole delivery pipeline. This article is the deep dive, written for a product manager, founder, or operations lead who already knows roughly what MPEG-H does — covered in MPEG-H 3D Audio explained — and now needs to understand how it actually works well enough to scope an integration, brief an engineering team, or evaluate a vendor's claim. A senior engineer will read it too and must find it accurate; every technical number here traces to the controlling standard — ISO/IEC 23008-3 and ATSC A/342 Part 3 — not to a secondhand summary.

A quick reset: what "3D audio system" means here

Before the machinery, one definition to hold onto, because the whole design follows from it. A traditional audio format is a recording: a fixed set of channels, frozen into one mix, that the player just decodes and plays. MPEG-H is not a recording. It is a scene — a description of the sounds in a program plus instructions for how to assemble them — that the player renders fresh for whatever speakers or headphones the listener actually owns.

That word "renders" is the heart of it. A renderer, in this context, is the part of the decoder that takes the described sounds and decides where they go for your specific setup. The same scene becomes a 7.1.4 home-cinema mix on one device, a stereo headphone mix with simulated height on another, and a soundbar mix on a third — without re-encoding, because the assembly happens at playback. Compare this to older codecs covered in a short history of audio codecs, where the mix was baked in at the studio and the player had no say.

MPEG-H is published by the ISO/IEC Moving Picture Experts Group as ISO/IEC 23008-3, "High efficiency coding and media delivery — Part 3: 3D audio." The edition normatively referenced by the current US broadcast profile is ISO/IEC 23008-3:2022 (ATSC A/342-3:2025-10, §2.1). It carries three kinds of sound at once — fixed channels, movable objects, and a recorded sound field — and lets the broadcaster expose chosen controls to the viewer. That much is the MPEG-H 3D Audio explained story. Everything below is the layer beneath it.

The audio scene: how MPEG-H describes sound

The single most important structure in MPEG-H is the audio scene — the formal description of everything a program contains and everything the listener may do with it. It is carried in a block of static data the standard calls Metadata Audio Elements, abbreviated MAE (ISO/IEC 23008-3, Clause 15; ATSC A/342-3:2025-10, §4.2.1). MAE is the difference between a codec and a system. A codec compresses sound; MAE explains it.

MAE is organized as a small hierarchy, and the names are worth learning because every interactive feature traces back to them. At the top sits the AudioSceneInfo — the root that says "this is the whole program." Beneath it are three kinds of structure (ATSC A/342-3:2025-10, §4.2.1.1–4.2.1.3):

A Group collects element signals that should be treated as one unit. A stereo recording whose two channels must always move together is one group; a sub-mix of crowd microphones is another. Grouping is how an engineer says "these belong together, handle them as a single object."

A Switch Group collects elements that are mutually exclusive — exactly one is active at a time. This is the mechanism behind language selection: three commentary tracks (say, two English variants and one foreign-language feed) live in one switch group, and the decoder guarantees you hear precisely one. You cannot accidentally stack two languages on top of each other, because the structure forbids it.

A Preset is a saved, named combination of groups and objects with their own gains and positions — a one-tap experience. The standard's own examples are telling: a "Dialogue Enhancement" preset that raises the dialogue and lowers the background, and a "Live Mix" preset for sports with boosted ambience, an extra crowd object, and the commentary muted (ATSC A/342-3:2025-10, §A.4.1). Presets are why a viewer can pick "stadium only" without understanding any of the underlying objects.

Diagram of the MPEG-H audio scene hierarchy: an AudioSceneInfo root branches into Groups for example music bed, sound effects, audio description, a Switch Group holding three mutually-exclusive commentary languages, and two Presets named Default and Dialogue Enhancement that select and re-gain those elements Figure 1. The MAE hierarchy. AudioSceneInfo at the root; Groups bundle related sounds; a Switch Group enforces "pick exactly one language"; Presets are the named one-tap experiences the viewer actually sees.

The crucial property: MAE describes not just what is in the scene but what the viewer may change and by how much. Each element carries flags — may the user adjust its gain, may they move its position — and the limits on those adjustments (ATSC A/342-3:2025-10, §A.4.1). The broadcaster authors the freedom; the decoder enforces it. This is why personalization never breaks the artistic intent: the mixer decides the commentary can rise by at most 12 dB, and no viewer slider can exceed that.

MHAS: the packet format that carries it all

A scene description and three kinds of sound need a container that can be cut, switched, and merged on the fly — at a broadcast splice point, at a channel change, mid-stream when the program layout changes. That container is the MPEG-H Audio Stream, abbreviated MHAS (ISO/IEC 23008-3, Clause 14; ATSC A/342-3:2025-10, §5.2.1). Think of MHAS as a train of labelled boxcars: each boxcar is a packet with a type, and the type tells any device along the line what is inside without having to decode the audio.

A handful of packet types carry the whole system. The configuration packet, PACTYP_MPEGH3DACFG, holds the decoder setup — channel layout, object count, core-codec parameters. The scene packet, PACTYP_AUDIOSCENEINFO, holds the MAE described above, and the spec requires it to sit directly after the config packet at every entry point (ATSC A/342-3:2025-10, §5.2.2.2). The audio itself rides in PACTYP_MPEGH3DAFRAME packets. When a viewer turns a knob, the result travels back into the stream as a PACTYP_USERINTERACTION packet that the decoder reads before rendering the next frame (ATSC A/342-3:2025-10, §A.4.3).

Two design choices make MHAS unusually flexible. First, every packet payload is byte-aligned, so a splicing device can find a boundary and cut without decoding the audio — essential for inserting a commercial at an exact video frame (ATSC A/342-3:2025-10, §4.2.3). Second, a special truncation packet, PACTYP_AUDIOTRUNCATION, lets the system shorten the last audio frame before a splice so the audio ends precisely where the video does, even though audio and video almost never use the same frame rate.

Diagram of an MHAS packet stream shown as a left-to-right train of labelled packets: a sync-sample group containing PACTYP_MPEGH3DACFG then PACTYP_AUDIOSCENEINFO then PACTYP_BUFFERINFO then PACTYP_MPEGH3DAFRAME, followed by further audio-frame packets, with a PACTYP_USERINTERACTION packet entering from the viewer interface and a PACTYP_AUDIOTRUNCATION packet marking a splice point Figure 2. An MHAS stream as a train of typed packets. The sync sample carries config and scene info so a decoder can tune in cold; user interaction and truncation packets are inserted live without re-encoding the audio.

Tuning in and switching without a glitch

A broadcast viewer changes channels constantly, and a streaming player switches bitrate whenever the network dips. Both demand that the decoder be able to start, or restart, cleanly mid-stream. MPEG-H handles this with the Random Access Point, or RAP — a "sync sample" that contains everything needed to cold-start the decoder: the config packet, the scene packet, a buffer-timing packet, and an audio frame, in that exact order (ATSC A/342-3:2025-10, §5.2.2.2). Land on a RAP and you can begin decoding from nothing. The standard even recommends a 100-millisecond fade-in on the first output buffer after a tune-in, so the sound arrives smoothly rather than with a click (ATSC A/342-3:2025-10, §A.3.1).

For seamless bitrate adaptation — the audio equivalent of what video does in audio in HLS, DASH, CMAF — MPEG-H uses Immediate Play-out Frames, or IPFs. An IPF carries enough information about the previous frames that the decoder can jump into a different bitrate representation at a segment boundary with no audible seam (ATSC A/342-3:2025-10, §4.2.4). Placing IPFs at segment edges is what makes glitch-free adaptive streaming of MPEG-H possible.

Two streams, one program: hybrid broadcast-plus-broadband

Here is a capability no legacy broadcast codec has, and it is the clearest argument for why next-generation audio is a system and not just a better compressor. A single MPEG-H program can be split across two or more independent streams — one over the air, one or more over the internet — and the decoder merges them into one scene (ATSC A/342-3:2025-10, §4.2.2, §5.2.2.4).

The mechanism is the mhm2 sample entry (the multi-stream variant of the normal mhm1 entry). The main stream carries at least the default presentation and is flagged with mae_isMainStream = 1; auxiliary streams carry extra pieces — additional languages, an audio-description track — and are flagged 0. As long as the streams are time-aligned and their random access points line up, the decoder stitches them into a single audio scene at playback (ATSC A/342-3:2025-10, §5.2.2.4).

The practical payoff is large. A broadcaster transmits the popular content — the program and its two main languages — over the scarce, expensive broadcast spectrum, and delivers the long tail — a rare language, a descriptive-audio track, a director's commentary — over broadband, where bandwidth is cheap. The viewer experiences one seamless program; the operator pays broadcast rates only for what most people watch. This is the kind of multi-track economics we explore in multi-language audio storage and CDN math, now solved at the codec layer.

The core coder: where the bits are actually saved

Underneath the scene model and the packet transport sits an ordinary job: compress the audio waveforms efficiently. MPEG-H's compression engine is built on the same family of math as AAC and the Unified Speech and Audio Coding system, USAC, using an enhanced Modified Discrete Cosine Transform — the MDCT — to turn sound into frequency ingredients and spend bits only where the ear notices (ISO/IEC 23008-3; the four ideas behind it are in how audio compression works). The core can handle up to 128 codec channels, which is what lets one decoder process a large bed plus many objects plus ambisonic components together.

When the scene includes a recorded sound field — Higher Order Ambisonics, or HOA, the natural fit for the head-turning audio of VR explored in Ambisonics, HRTF, and binaural rendering — the encoder decomposes that field into a set of predominant directional signals plus an ambience signal, and codes each with the USAC-derived core (ISO/IEC 23008-3). The renderer reverses the process for the listener's layout, including rotating the whole field to follow a head-tracked headphone.

The point worth keeping: the core coder is competent, but it is not the differentiator. AAC and AC-4 have competent cores too. MPEG-H's advantage is everything stacked on top of the core — the scene, the metadata, the interactivity — which is why the article spends most of its words there.

Profiles and levels: what a real TV chip implements

A flexible system needs guard rails, or a cheap television and a premium receiver could not both honestly claim to support "MPEG-H." Those guard rails are profiles (which tools must be implemented) and levels (how much sound a decoder must handle). The full toolkit is the Main profile with five levels, topping out at 64 loudspeaker channels and 128 core channels — enough for a 22.2 theatre and far past anything a home needs (ISO/IEC 23008-3). But almost nobody implements Main in a consumer device.

For broadcast and streaming, two cut-down profiles matter. The Low Complexity (LC) profile, added in Amendment 3 (late 2016), keeps the channel, object, and HOA tools while cutting decoder complexity by roughly half versus the fuller toolkit — cheap enough to put in a TV system-on-chip (Fraunhofer IIS; ISO/IEC 23008-3). The Baseline (BL) profile, promoted to final status in 2020, is a subset of LC tuned for broadcast, streaming, and immersive music; it carries channels and objects, simplifies metadata handling, and drops the advanced HOA processing (Fraunhofer IIS; audioXpress, 2020).

What ships in ATSC 3.0 today is concrete: a 2025 revision of the US broadcast standard added the Baseline profile alongside LC, and both are restricted to Levels 1, 2, or 3 only — Main's higher levels are not used on the air (ATSC A/342-3:2025-10, §5.1). Two real ceilings follow from that choice. LC Profile Level 3 limits the decoder's loudspeaker output to 12 channels — comfortably enough for 7.1.4 (twelve speakers), which is the practical immersive home target (ATSC A/342-3:2025-10, §A.2.1). And the broadcast maximum bitrate is capped at 1,200 kbps, rising to 1,540 or 2,400 kbps only in specific higher-channel-count configurations (ATSC A/342-3:2025-10, §5.2.1).

Profile	Where it ships	Tools carried	Used in ATSC 3.0
Main (5 levels)	Reference / studio	Channels + objects + full HOA, up to 64 speakers	No
Low Complexity (LC)	Broadcast TV chips	Channels + objects + HOA, ~50% lower complexity	Yes (Levels 1–3)
Baseline (BL)	TVs, music devices	Channels + objects, simplified metadata, no advanced HOA	Yes (Levels 1–3)

Table 1. The three MPEG-H profiles that matter in practice. Consumer broadcast and music run LC or Baseline at Levels 1–3; the full Main profile is a studio and reference target. Source: ISO/IEC 23008-3; ATSC A/342-3:2025-10, §5.1; Fraunhofer IIS.

A worked example: why an object beats a second mix

Personalization has a bandwidth cost, and the number is easy to compute — which matters when you are budgeting a broadcast multiplex. Suppose a sports program wants a 5.1 stadium bed plus two commentary languages. The legacy approach ships two complete 5.1 mixes and lets the player pick one:

two full 5.1 mixes = 2 × 192 kbps = 384 kbps

The MPEG-H approach ships one 5.1 bed and adds each language as a mono commentary object inside a switch group:

one 5.1 bed + two mono objects = 192 kbps + (2 × 48 kbps) = 288 kbps

That is 96 kbps saved — a 25% reduction — and the gap widens with every extra language, because each new option is one small object rather than a whole surround mix. Five languages the brute-force way would be 5 × 192 = 960 kbps; the object way is 192 + (5 × 48) = 432 kbps, a 55% saving. The exact figures depend on the encoder, but the structure is the lesson: objects make personalization scale sub-linearly, while duplicated mixes scale linearly.

Loudness and dynamic range: the part broadcasters insist on

A system that lets viewers turn knobs has a regulatory problem: if you raise the dialogue, the program gets louder, and loudness rules — the CALM Act in the US, the equivalents elsewhere, all built on the measurement explained in loudness, peak, RMS, LUFS — do not allow that. MPEG-H solves it by baking loudness control into the format rather than leaving it to the broadcaster's playout chain.

MPEG-H inherits its loudness and dynamic-range machinery from MPEG-D DRC, formally ISO/IEC 23003-4:2025 (ATSC A/342-3:2025-10, §4.2.5, §2.1). The stream carries loudness metadata measured per ITU-R BS.1770 — the same algorithm behind loudness normalization with EBU R128 and ATSC A/85 — and a set of dynamic-range-control profiles the device can pick from based on whether it is an AV receiver, a TV speaker, or a phone (ATSC A/342-3:2025-10, §A.5). The decoder's loudness normalization is expected to be on at all times.

The clever part is loudness compensation. Whenever the broadcaster allows gain interactivity on an element, the stream must also carry compensation data, and the decoder applies it automatically: as the viewer raises the dialogue object, the system quietly trims the overall level so the program's measured loudness stays put (ATSC A/342-3:2025-10, §4.2.6, §5.3). The viewer gets clearer dialogue; the regulator gets a compliant program. The standard pins the target-loudness range that DRC sets must cover at −31 dB to 0 dB, continuously (ATSC A/342-3:2025-10, §5.3). DRC also gives broadcasters time-varying ducking — automatically lowering selected elements under a voice-over or an audio-description narration, a feature directly relevant to the accessibility stack.

Common mistake: confusing the scene with a render

The single most common error product teams make with MPEG-H is treating a binaural headphone output as if it were the master. It is not. The MPEG-H decoder can render the scene to headphones using head-related transfer functions — the HRTF data and binaural impulse responses fed in per ISO/IEC 23008-3, Clause 13 — and the standard scales that rendering by level: at most 2 binaural response pairs at Level 1, 6 at Level 2, 11 at Level 3 (ATSC A/342-3:2025-10, §A.2.2). That binaural mix is a derived render, computed on the device for one pair of ears. The deliverable, the thing you author and archive, is the scene: channels, objects, HOA, and MAE. Confusing the two leads teams to "master in binaural," which throws away the very flexibility they paid for. Author the scene; let the renderer make the binaural.

Where MPEG-H actually lives in 2026

MPEG-H is a broadcast-and-music system, and in 2026 the map is clearer than it has ever been. The anchor remains South Korea, where the TTA standard published in June 2016 named MPEG-H as the sole audio codec for terrestrial UHD, and broadcasters SBS, MBC, and KBS launched full-time ATSC 3.0 services on 31 May 2017 — the first next-generation audio codec on air 24/7 anywhere (TTA, 2016; Fraunhofer IIS; audioXpress, 2017). It carried the 2018 PyeongChang Winter Olympics in immersive sound and has run continuously since.

The bigger 2026 story is two new national commitments. In the United States, ATSC 3.0 ("NextGen TV") is voluntary but advancing; a revised national-deployment timeline now points to 2027, and the audio standard itself was revised again — A/342-3:2026-04 (April 2026) is the current version, after the October 2025 revision that added the Baseline profile (ATSC; ATSC A/342-3:2025-10). And Brazil formally adopted an ATSC-3.0-based system, branded DTV+ (TV 3.0), with MPEG-H Audio as a required component, targeting commercial readiness for the 2026 FIFA World Cup (ATSC; V-Nova; TV Tech, 2025). Brazil is the first national system to mandate MPEG-H audio alongside VVC video and LCEVC.

On the music side, Sony's 360 Reality Audio, launched 8 January 2019, is built on MPEG-H's object model and ships through streaming partners; Sony licensed an MPEG-H decoder for both ATSC 3.0 and DVB devices (Sony / Fraunhofer IIS, 2021). The takeaway for a product team: if you ship to Korean or Brazilian broadcast, to ATSC 3.0 receivers, to immersive-music apps, or to the newest smart TVs, MPEG-H is a format you will meet — and now you know what is moving underneath the surface.

Where Fora Soft fits in

We build OTT and Internet-TV platforms, video streaming back-ends, and conferencing and telemedicine systems, and the audio layer is where products quietly succeed or fail. For clients eyeing next-generation broadcast or immersive-music delivery, the work is rarely writing a codec — it is the integration around it: packaging MPEG-H into the right container, wiring the personalization controls into a player UI, getting loudness compensation honored end to end, and handling the hybrid broadcast-plus-broadband merge so a rare-language track arrives cleanly. We have shipped streaming and conferencing audio pipelines since 2005, and the lessons from object-based, metadata-driven audio carry directly into the spatial-conferencing and accessibility features clients increasingly ask for.

Call to action

Talk to a audio engineer — book a 30-minute scoping call to talk through your mpeg-h 3d audio plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the MPEG-H 3D Audio: System Cheat Sheet — One page: the MAE scene model (AudioSceneInfo, Group, Switch Group, Preset), the five key MHAS packet types, the three profiles with their ATSC 3.0 level limits, and six facts including the 2026 deployment map (Korea, Brazil DTV+, Sony….

References

ISO/IEC 23008-3:2022 — Information technology — High efficiency coding and media delivery in heterogeneous environments — Part 3: 3D audio. International Organization for Standardization / IEC. The controlling specification for MPEG-H 3D Audio: scene model (Clause 15), MHAS transport (Clause 14), binaural rendering (Clause 13), profiles and levels (Clause 4.8.2). Edition normatively referenced by ATSC A/342-3:2025-10. (Tier 1.)
ATSC A/342-3:2025-10 — ATSC Standard: A/342 Part 3, "MPEG-H System." Advanced Television Systems Committee, 17 October 2025. The revision that added the Baseline profile to ATSC 3.0; source for MAE hierarchy (§4.2.1), MHAS packet constraints (§5.2), RAP/IPF behavior (§5.2.2.2, §4.2.4), multi-stream mhm2 delivery (§5.2.2.4), loudness/DRC signaling (§5.3), profile/level restrictions (§5.1), binaural levels and tune-in fade-in (Annex A). Superseded by A/342-3:2026-04 (April 2026); both cited. (Tier 1.)
ISO/IEC 23003-4:2025 — MPEG audio technologies — Part 4: Dynamic Range Control. ISO/IEC. The MPEG-D DRC framework MPEG-H inherits for loudness metadata and dynamic-range control; referenced normatively by ATSC A/342-3:2025-10 §2.1. (Tier 1.)
ITU-R BS.1770-5 — Algorithms to measure audio programme loudness and true-peak audio level. ITU-R, 2023. The loudness-measurement algorithm whose metadata MPEG-H carries; the standard against which broadcast loudness is judged. (Tier 1.)
ATSC A/342-1:2025-07 — ATSC Standard: A/342 Part 1, "Audio Common Elements." ATSC, 17 July 2025. Defines the Audio Program / Presentation / Component model and Audio Description that MAE maps onto. (Tier 1.)
Bleidt, R. et al., Development of the MPEG-H TV Audio System for ATSC 3.0. IEEE Transactions on Broadcasting, 2017. Primary engineering account of the system from its Fraunhofer architects; basis for the Korea deployment and core-coder details. (Tier 5.)
Fraunhofer IIS — MPEG-H Audio technology pages and Audio Blog (Baseline profile licensing; "World's 1st Terrestrial UHD TV Service With MPEG-H Audio," Korea). Maker of the technology; source for profile complexity figures and deployment dates. (Tier 3.)
audioXpress — "MPEG-H TV Audio On The Air as South Korea Launches UHD TV Services Based on ATSC 3.0" (2017) and "Fraunhofer IIS Announces Licensing Efforts for New MPEG-H 3D Audio Baseline Profile" (2020). Deployment and profile-status reporting. (Tier 4.)
ATSC — "Brazil Officially Adopts ATSC 3.0 Technologies For Its Next-Generation Television System"; TV Tech / V-Nova reporting on TV 3.0 / DTV+. Source for Brazil's MPEG-H mandate and the 2026 FIFA World Cup target. (Tier 4.)
Sony / Fraunhofer IIS — "Sony licenses MPEG-H 3D audio decoder software for DVB and ATSC 3.0 broadcasts" (2021); 360 Reality Audio launch (8 January 2019). Source for the music deployment. (Tier 4.)

Note on conflicts (per the spec hierarchy): where popular write-ups describe MPEG-H purely as "a codec," this article follows ISO/IEC 23008-3 and ATSC A/342-3 in treating it as a system (codec + MAE + MHAS + DRC), since the normative text spends most of its clauses on the non-codec layers. Where secondary sources gave a single "max bitrate," the article uses the specific tiered caps in ATSC A/342-3:2025-10 §5.2.1 (1,200 / 1,540 / 2,400 kbps by core-channel count).

MPEG-H 3D Audio: The ISO Open Immersive Standard

Why this matters

A quick reset: what "3D audio system" means here

The audio scene: how MPEG-H describes sound

MHAS: the packet format that carries it all

Tuning in and switching without a glitch

Two streams, one program: hybrid broadcast-plus-broadband

The core coder: where the bits are actually saved

Profiles and levels: what a real TV chip implements

A worked example: why an object beats a second mix

Loudness and dynamic range: the part broadcasters insist on

Common mistake: confusing the scene with a render

Where MPEG-H actually lives in 2026

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

MPEG-H 3D Audio: The ISO Open Immersive Standard

Why this matters

A quick reset: what "3D audio system" means here

The audio scene: how MPEG-H describes sound

MHAS: the packet format that carries it all

Tuning in and switching without a glitch

Two streams, one program: hybrid broadcast-plus-broadband

The core coder: where the bits are actually saved

Profiles and levels: what a real TV chip implements

A worked example: why an object beats a second mix

Loudness and dynamic range: the part broadcasters insist on

Common mistake: confusing the scene with a render

Where MPEG-H actually lives in 2026

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Loudness

MPEG-H 3D Audio

Channel

Bitrate

Binaural rendering

ITU-R BS.1770

Dynamic range

Audio frame