Why this matters
If you ship a streaming service, an OTT app, a conferencing tool, or any product that plays media, the container is where your audio plumbing lives — and it is the layer most teams never look at until something breaks. The wrong audio track plays, a language is missing, the iOS app picks 5.1 when the TV only wants stereo, or the file plays perfectly in VLC and silently fails in Safari. Almost every one of those bugs is a container decision, not a codec decision. This article gives a product manager or developer the vocabulary to reason about audio tracks, segments, and PIDs well enough to read a manifest, talk to an engineer, and choose the right container before the bug report arrives.
First, the one distinction that explains everything: codec vs container
Almost every container confusion traces back to one mix-up, so we fix it first. The codec is the method that compressed the audio — it decides how the sound becomes a small stream of bytes. Opus, AAC, AC-3, and FLAC are codecs. The container is the file format that wraps those compressed bytes together with the instructions a player needs: which tracks exist, what each one is, how long they run, and how to keep them in sync. MP4, Matroska, and MPEG-TS are containers.
The everyday analogy is a shipping box. The codec is how the contents were packed — vacuum-sealed, shrink-wrapped, loose. The container is the cardboard box with the shipping label on the outside: what's inside, where it goes, how heavy it is. You can put the same vacuum-sealed item in a small box or a big box; you can put many different items in one box. The label, not the packing method, tells the courier how to handle it.
This is why "MP4 audio" is a slightly wrong phrase that you will hear constantly. MP4 does not have a sound of its own. An MP4 file carries an audio track, and that track is almost always AAC, but it could be Opus, AC-3, or others. When someone says "the audio is MP4", they mean "the AAC (or other) audio track is wrapped in an MP4 container". Holding codec and container apart is the single most useful habit in this whole topic.
Two ideas to keep from this section. First, the codec compresses the sound; the container labels and bundles it. Second, the same codec can live in several containers, and the same container can hold several codecs — they vary independently.
Figure 1. The codec packs the sound; the container is the labelled box that tells the player what's inside and how to play it.
What a "track" is, and why it is the key idea for audio
Inside any modern container, media is organised into tracks. A track is one continuous stream of one kind of media — one video picture stream, or one audio stream, or one subtitle stream — described by its own block of metadata. The metadata on an audio track is small but decisive: the codec it uses, the sample rate (how many times per second the sound was measured, usually 48,000 for video — see sample rate), the channel layout (mono, stereo, or 5.1 surround), and a language tag.
Because each audio track is self-describing, one file can carry many of them. A single movie file routinely holds an English stereo track, an English 5.1 track, a Spanish dub, a French dub, and a director's commentary — five audio tracks, each tagged with its language and layout, all sharing one video track. The player reads the track list, shows you the menu, and switches between them without touching the picture. That switchable-track model is the same in MP4, Matroska, and MPEG-TS; only the on-disk machinery differs. Everything else in this article is just four answers to the question "how does this container store and label its audio tracks?"
MP4: the universal file, and how it stores audio in boxes
MP4 is the format almost every device on earth can play, and it is built from a simple, repeating unit called a box (the specification calls it a box; older tools call it an "atom" — same thing). A box is a labelled container-within-the-container: it has a size, a four-letter type code, and a payload that is either data or more boxes nested inside it. The whole file is a tree of these boxes. MP4 is one profile of a broader standard, the ISO Base Media File Format (ISOBMFF), defined in ISO/IEC 14496-12; the MP4-specific rules live in ISO/IEC 14496-14.
Three top-level boxes do the heavy lifting. The ftyp box at the very front is the label on the box — it declares which format rules the file follows, so a player knows immediately whether it can read the file. The moov box is the table of contents — it holds all the metadata for every track but none of the actual audio or video. The mdat box is the cargo hold — it holds the raw compressed samples with no labels of their own; the moov is what points into it.
For an audio track, the metadata sits in a precise nest of boxes inside moov. The path runs moov → trak → mdia → minf → stbl → stsd, and the audio-specific marker along the way is the smhd (sound media header) box, which is how a player knows this track is audio rather than video. The destination, the stsd (sample description) box, names the codec and carries its setup data. For an AAC track, the codec entry is mp4a, and tucked inside it is a small box called esds that holds the few bytes of AAC configuration (the AudioSpecificConfig) a decoder needs before it can play a single frame. Miss those bytes and the track is silent — a real and common packaging bug.
The order of the two big boxes matters for streaming. In a plain download file the order is often [ftyp][mdat][moov] — the table of contents sits at the end. That is fine for a file you have fully downloaded, but a browser trying to play while downloading can't start until it has the moov, so it would have to fetch the whole file first. The fix is fast-start (also called "web optimisation" or "moov atom relocation"): the moov box is moved to the front, giving [ftyp][moov][mdat], so the player gets the table of contents immediately and can begin playback after the first few seconds arrive. If your progressive-download videos "buffer forever before starting", an un-relocated moov is the first thing to check.
Figure 2. The MP4 box tree for an audio track:
moov is the table of contents, mdat is the cargo hold, and the codec setup lives in stsd → mp4a → esds.
Multiple audio tracks in one MP4
Yes, an MP4 can hold many audio tracks — this is the answer to one of the most-asked questions about the format. Each language or mix is simply its own trak box inside moov, each with its own codec entry, channel layout, and language tag. The classic mistake is not whether MP4 can hold them, but whether the player exposes them: some progressive-download players and some browsers only present the first audio track, so a French viewer never sees the French option even though it is in the file. For reliable multi-track audio in a browser, the answer is usually streaming with a manifest (covered next), not a single progressive MP4.
fMP4 and CMAF: the same boxes, cut into segments for streaming
A plain MP4 is one long file, which is wrong for streaming, where a player wants to fetch a few seconds at a time and switch quality on the fly. Fragmented MP4 (fMP4) solves this by taking the exact same box format and splitting it into pieces. Instead of one giant moov followed by one giant mdat, an fMP4 stream begins with a small initialisation segment (the ftyp plus a moov that describes the tracks but contains no media), followed by a series of media segments, each a moof (movie fragment) box paired with its own small mdat. The moof is a mini table of contents for just that segment's samples; the mdat holds just that segment's audio and video.
This segmentation is what makes adaptive streaming work, and audio rides along inside it. The audio for each few-second segment is one moof+mdat pair, addressable on its own, so a player can request audio segments independently of video segments and switch audio tracks at a segment boundary. CMAF — the Common Media Application Format, ISO/IEC 23000-19 — standardises this single fMP4 segment shape so the same audio and video segments can serve both HLS and DASH players. Before CMAF, a service often stored MPEG-TS segments for Apple's HLS and separate MP4 segments for DASH — two copies of everything. CMAF lets one set of fMP4 segments feed both, which cuts packaging and storage roughly in line with not storing a second copy. How a player actually chooses among the audio segments it's offered is a streaming-layer question we cover in audio in HLS, DASH, CMAF.
Matroska and WebM: the flexible open container
Matroska (the .mkv file) is an open container built to hold an essentially unlimited number of tracks of any type, with rich metadata, chapters, and attachments. Its byte format is EBML (Extensible Binary Meta Language), a binary cousin of XML — a tree of named, length-prefixed elements, much like MP4's boxes but more general. As of 2024 Matroska is specified in an IETF document, RFC 9559, which moved the format from a community spec to a published standard.
Audio in Matroska lives in a TrackEntry element, and the field that names the codec is the CodecID, a text string. The mappings are readable on sight: A_OPUS for Opus, A_VORBIS for Vorbis, A_AAC for AAC, A_FLAC for FLAC, A_AC3 for Dolby Digital. The audio properties — sample rate, channel count, bit depth — sit in an Audio sub-element on the same track. One subtlety worth knowing: AAC stored in Matroska is stripped of its ADTS framing headers and muxed as raw frames, because Matroska supplies its own framing — a detail that bites you when a tool extracts the AAC and gets a stream a naive decoder can't parse without re-adding headers.
WebM is a strict subset of Matroska designed for the open web. It deliberately allows only royalty-free codecs: VP8/VP9/AV1 for video, and for audio the spec says the CodecID SHOULD be A_VORBIS or A_OPUS. In practice WebM audio in 2026 is almost always Opus (see the Opus codec). If you serve <video> in HTML with a WebM file, your audio is Opus inside a Matroska-subset container — three independent facts that people routinely collapse into "it's a WebM".
MPEG-TS: the broadcast container that thinks in tiny packets
MPEG-TS (the MPEG-2 Transport Stream, ISO/IEC 13818-1) is the oldest container here and still the most widely deployed in broadcast, cable, satellite, and a great deal of legacy HLS. It was designed for a hostile world — a one-way broadcast channel where bits get lost and a receiver might tune in at any moment — so it works completely differently from the file-tree containers above. Instead of one big file with a table of contents at the front, MPEG-TS is a relentless stream of small, fixed 188-byte packets, each stamped with a number that says which stream it belongs to.
That number is the PID (Packet Identifier), and it is the key to how MPEG-TS handles audio. Each elementary stream — one video stream, one audio stream — gets its own PID, and its compressed data is chopped into PES (Packetised Elementary Stream) units that are then sliced across the 188-byte transport packets carrying that PID. A receiver finds the audio it wants by filtering for the right PID and reassembling the PES from the packets. Because the format must let a receiver join mid-stream, the map of "which PID is which" is broadcast over and over in a small table called the PMT (Program Map Table). The PMT lists each stream's PID and a stream_type code that names the codec: 0x0F for AAC (in ADTS framing), 0x11 for AAC in LATM framing, 0x81 for AC-3 (the ATSC convention), and 0x03/0x04 for the older MPEG-1/2 audio layers. Multiple audio languages are simply multiple audio PIDs listed in the PMT, each with a language descriptor.
The practical takeaway: MPEG-TS trades efficiency for robustness. Repeating the PMT and stamping every 188-byte packet costs overhead a file container avoids, but it means a receiver can recover from loss and start playback from any point — which is exactly what broadcast and old-style HLS need. The cost is why the industry is migrating HLS to CMAF fMP4 wherever the player base allows.
Figure 3. MPEG-TS carries audio as a PID: the PMT maps each PID to a codec
stream_type, and the audio PES is sliced across 188-byte packets so a receiver can join mid-stream.
A worked example: how much does a second audio track actually add?
A common worry is that adding audio tracks bloats the file. Let's size it with real arithmetic. Take a 90-minute feature, 48 kHz audio, and add one extra stereo AAC commentary track at a typical 128 kbps:
bitrate = 128 kbps = 128,000 bits per second
duration = 90 min = 5,400 seconds
size (bits) = 128,000 × 5,400 = 691,200,000 bits
size (bytes) = 691,200,000 ÷ 8 = 86,400,000 bytes
size (MB) = 86,400,000 ÷ 1,000,000 ≈ 86 MB
So a whole extra stereo language costs about 86 MB on a feature film — trivial next to a multi-gigabyte HD video track. A 5.1 AC-3 track at 384 kbps works out to three times that, roughly 259 MB, still small relative to the video. The container overhead to carry these tracks (the extra trak box, or the extra PID in TS) is negligible — kilobytes. This is why multi-language delivery is a track-and-storage decision, not a "can we afford it" decision; the real cost is at CDN scale across many titles, which we size in multi-language audio storage and CDN math.
A side-by-side comparison
| Criterion | MP4 / fMP4 | Matroska / WebM | MPEG-TS |
|---|---|---|---|
| Standard | ISO/IEC 14496-12 / -14; CMAF 23000-19 | RFC 9559 (Matroska); WebM subset | ISO/IEC 13818-1 |
| Structure | Box tree (moov/mdat) |
EBML element tree (TrackEntry) |
188-byte packets with PIDs |
| How audio is named | stsd codec entry (mp4a + esds) |
CodecID string (A_OPUS, A_AAC) |
stream_type in the PMT |
| Typical audio codec | AAC; Opus in fMP4/WebM contexts | Opus (WebM), AAC/FLAC/AC-3 (MKV) | AAC, AC-3, MP2 |
| Multi-track audio | Yes (multiple trak) |
Yes (many TrackEntry) |
Yes (multiple audio PIDs) |
| Best at | Universal playback, streaming via CMAF | Flexibility, archival, open web | Broadcast, loss recovery, legacy HLS |
| Start-from-anywhere | Only at segment boundaries (fMP4) | Cluster boundaries | Any point (designed for it) |
A pitfall worth memorising: the silent-track packaging bug
The most common container-level audio failure is not a missing track — it is a present track that plays silence or refuses to decode, and it almost always comes from missing or wrong codec-setup metadata. In MP4 this is a missing or malformed esds (the AAC config bytes); in Matroska it is a missing CodecPrivate; in MPEG-TS it is a wrong stream_type or absent audio descriptor in the PMT. The file looks complete, plays in one tool that guesses well, and fails in a strict player like Safari or a hardware decoder that trusts the metadata literally. The lesson: when audio fails in one player but works in another, suspect the container's codec-setup box before you suspect the codec or the audio data itself. A remux that rebuilds the setup metadata fixes far more "broken audio" reports than a re-encode does.
Where Fora Soft fits in
Container choices surface in nearly every video product we build. In OTT and streaming work, the move to CMAF fMP4 means one set of audio segments serving HLS and DASH instead of two, which simplifies the packaging pipeline and the storage bill. In video conferencing and e-learning recording, the call is recorded into a container — often Matroska or MP4 — where the audio track's setup metadata and language tags have to be correct or downstream transcription and playback break. In surveillance and telemedicine, MPEG-TS and fMP4 both appear depending on whether the path is broadcast-style or web-style. Across all of them, the audio bugs that reach a user — wrong track, missing language, silent playback — are usually container decisions, which is why we treat packaging as a first-class part of the pipeline, not an afterthought.
What to read next
- PCM, WAV, AIFF, FLAC, ALAC: lossless formats explained — the raw and lossless formats that often sit inside these containers as masters.
- Frames, packets, granules: why audio is chunked — what the compressed samples inside
mdatand PES actually look like. - Audio in HLS, DASH, CMAF: how a streaming player picks an audio track — how the player chooses among the audio tracks these containers expose.
Call to action
- Talk to a audio engineer — book a 30-minute scoping call to talk through your audio in mp4 container plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
- Download the Audio container cheat sheet — One-page reference: the four containers (MP4/fMP4, Matroska/WebM, MPEG-TS), how each one names its audio (stsd+esds, CodecID, stream_type), the MPEG-TS stream_type table, how an MP4 stores audio, and a silent-track packaging checklist.
References
- ISO/IEC 14496-12:2022, Information technology — Coding of audio-visual objects — Part 12: ISO base media file format (ISO/IEC, 2022). The controlling specification for the box structure (
ftyp,moov,trak,mdia,minf,smhd,stbl,stsd), fragmented files (moof/mdat), and the audio sample-entry model used throughout this article. https://www.iso.org/standard/83102.html - ISO/IEC 14496-14:2020, Information technology — Coding of audio-visual objects — Part 14: MP4 file format (ISO/IEC, 2020). The MP4-specific profile of ISOBMFF, including the
mp4aaudio sample entry and theesdsbox carrying the MPEG-4 ES_Descriptor / AudioSpecificConfig. https://www.iso.org/standard/79110.html - ISO/IEC 23000-19:2024, Information technology — Multimedia application format (MPEG-A) — Part 19: Common media application format (CMAF) for segmented media (ISO/IEC, 2024). Defines the single fragmented-MP4 segment shape that lets one set of audio/video segments serve both HLS and DASH. https://www.iso.org/standard/85623.html
- RFC 9559, Matroska Media Container Format Specification (IETF, May 2024). The published specification for Matroska's EBML structure,
TrackEntry, andCodecID-based codec mapping. https://datatracker.ietf.org/doc/rfc9559/ - Matroska Codec Mappings (Matroska.org / IETF CELLAR draft). The
CodecIDstring table —A_OPUS,A_VORBIS,A_AAC,A_FLAC,A_AC3— and the note that AAC is stored without ADTS headers. https://www.matroska.org/technical/codec_specs.html - WebM Container Guidelines (The WebM Project). States that for WebM the audio
CodecIDSHOULD beA_VORBISorA_OPUS, constraining WebM audio to royalty-free codecs. https://www.webmproject.org/docs/container/ - ISO/IEC 13818-1:2023, Information technology — Generic coding of moving pictures and associated audio information — Part 1: Systems (ISO/IEC, 2023). The MPEG-2 Transport Stream specification: 188-byte packets, PIDs, PES, the PMT, and
stream_typecodes for audio. https://www.iso.org/standard/83239.html - ATSC A/53 Part 3:2013, Service Multiplex and Transport Subsystem Characteristics (ATSC, August 2013). The North American convention that maps AC-3 audio to
stream_type0x81 in the MPEG-TS PMT, distinguished here from the base MPEG-2 registrations. https://www.atsc.org/wp-content/uploads/2015/03/A53-Part-3-2013.pdf - Fun with Container Formats: Fragmented MP4 & CMAF (Bitmovin engineering blog, accessed 2026-06-05). First-party deployer explanation of fMP4 init segments and
moof/mdatmedia segments used to corroborate the streaming-segment description; the ISO standards above are authoritative where they differ. https://bitmovin.com/fun-with-container-formats-2/


