Why this matters

If you ship a streaming service, an OTT app, a conferencing tool, or any product that plays media, the container is where your audio plumbing lives — and it is the layer most teams never look at until something breaks. The wrong audio track plays, a language is missing, the iOS app picks 5.1 when the TV only wants stereo, or the file plays perfectly in VLC and silently fails in Safari. Almost every one of those bugs is a container decision, not a codec decision. This article gives a product manager or developer the vocabulary to reason about audio tracks, segments, and PIDs well enough to read a manifest, talk to an engineer, and choose the right container before the bug report arrives.

First, the one distinction that explains everything: codec vs container

Almost every container confusion traces back to one mix-up, so we fix it first. The codec is the method that compressed the audio — it decides how the sound becomes a small stream of bytes. Opus, AAC, AC-3, and FLAC are codecs. The container is the file format that wraps those compressed bytes together with the instructions a player needs: which tracks exist, what each one is, how long they run, and how to keep them in sync. MP4, Matroska, and MPEG-TS are containers.

The everyday analogy is a shipping box. The codec is how the contents were packed — vacuum-sealed, shrink-wrapped, loose. The container is the cardboard box with the shipping label on the outside: what's inside, where it goes, how heavy it is. You can put the same vacuum-sealed item in a small box or a big box; you can put many different items in one box. The label, not the packing method, tells the courier how to handle it.

This is why "MP4 audio" is a slightly wrong phrase that you will hear constantly. MP4 does not have a sound of its own. An MP4 file carries an audio track, and that track is almost always AAC, but it could be Opus, AC-3, or others. When someone says "the audio is MP4", they mean "the AAC (or other) audio track is wrapped in an MP4 container". Holding codec and container apart is the single most useful habit in this whole topic.

Two ideas to keep from this section. First, the codec compresses the sound; the container labels and bundles it. Second, the same codec can live in several containers, and the same container can hold several codecs — they vary independently.

Side-by-side diagram contrasting the codec, which compresses the audio bytes, with the container, which wraps those bytes together with track metadata and timing, illustrated as a shipping box with a label Figure 1. The codec packs the sound; the container is the labelled box that tells the player what's inside and how to play it.

What a "track" is, and why it is the key idea for audio

Inside any modern container, media is organised into tracks. A track is one continuous stream of one kind of media — one video picture stream, or one audio stream, or one subtitle stream — described by its own block of metadata. The metadata on an audio track is small but decisive: the codec it uses, the sample rate (how many times per second the sound was measured, usually 48,000 for video — see sample rate), the channel layout (mono, stereo, or 5.1 surround), and a language tag.

Because each audio track is self-describing, one file can carry many of them. A single movie file routinely holds an English stereo track, an English 5.1 track, a Spanish dub, a French dub, and a director's commentary — five audio tracks, each tagged with its language and layout, all sharing one video track. The player reads the track list, shows you the menu, and switches between them without touching the picture. That switchable-track model is the same in MP4, Matroska, and MPEG-TS; only the on-disk machinery differs. Everything else in this article is just four answers to the question "how does this container store and label its audio tracks?"

MP4: the universal file, and how it stores audio in boxes

MP4 is the format almost every device on earth can play, and it is built from a simple, repeating unit called a box (the specification calls it a box; older tools call it an "atom" — same thing). A box is a labelled container-within-the-container: it has a size, a four-letter type code, and a payload that is either data or more boxes nested inside it. The whole file is a tree of these boxes. MP4 is one profile of a broader standard, the ISO Base Media File Format (ISOBMFF), defined in ISO/IEC 14496-12; the MP4-specific rules live in ISO/IEC 14496-14.

Three top-level boxes do the heavy lifting. The ftyp box at the very front is the label on the box — it declares which format rules the file follows, so a player knows immediately whether it can read the file. The moov box is the table of contents — it holds all the metadata for every track but none of the actual audio or video. The mdat box is the cargo hold — it holds the raw compressed samples with no labels of their own; the moov is what points into it.

For an audio track, the metadata sits in a precise nest of boxes inside moov. The path runs moov → trak → mdia → minf → stbl → stsd, and the audio-specific marker along the way is the smhd (sound media header) box, which is how a player knows this track is audio rather than video. The destination, the stsd (sample description) box, names the codec and carries its setup data. For an AAC track, the codec entry is mp4a, and tucked inside it is a small box called esds that holds the few bytes of AAC configuration (the AudioSpecificConfig) a decoder needs before it can play a single frame. Miss those bytes and the track is silent — a real and common packaging bug.

The order of the two big boxes matters for streaming. In a plain download file the order is often [ftyp][mdat][moov] — the table of contents sits at the end. That is fine for a file you have fully downloaded, but a browser trying to play while downloading can't start until it has the moov, so it would have to fetch the whole file first. The fix is fast-start (also called "web optimisation" or "moov atom relocation"): the moov box is moved to the front, giving [ftyp][moov][mdat], so the player gets the table of contents immediately and can begin playback after the first few seconds arrive. If your progressive-download videos "buffer forever before starting", an un-relocated moov is the first thing to check.

Tree diagram of the MP4 box hierarchy for an audio track, showing ftyp, moov containing trak, mdia, minf with smhd, stbl, stsd, mp4a and esds, and the mdat media box, with a note on fast-start moov relocation Figure 2. The MP4 box tree for an audio track: moov is the table of contents, mdat is the cargo hold, and the codec setup lives in stsd → mp4a → esds.

Multiple audio tracks in one MP4

Yes, an MP4 can hold many audio tracks — this is the answer to one of the most-asked questions about the format. Each language or mix is simply its own trak box inside moov, each with its own codec entry, channel layout, and language tag. The classic mistake is not whether MP4 can hold them, but whether the player exposes them: some progressive-download players and some browsers only present the first audio track, so a French viewer never sees the French option even though it is in the file. For reliable multi-track audio in a browser, the answer is usually streaming with a manifest (covered next), not a single progressive MP4.

fMP4 and CMAF: the same boxes, cut into segments for streaming

A plain MP4 is one long file, which is wrong for streaming, where a player wants to fetch a few seconds at a time and switch quality on the fly. Fragmented MP4 (fMP4) solves this by taking the exact same box format and splitting it into pieces. Instead of one giant moov followed by one giant mdat, an fMP4 stream begins with a small initialisation segment (the ftyp plus a moov that describes the tracks but contains no media), followed by a series of media segments, each a moof (movie fragment) box paired with its own small mdat. The moof is a mini table of contents for just that segment's samples; the mdat holds just that segment's audio and video.

This segmentation is what makes adaptive streaming work, and audio rides along inside it. The audio for each few-second segment is one moof+mdat pair, addressable on its own, so a player can request audio segments independently of video segments and switch audio tracks at a segment boundary. CMAF — the Common Media Application Format, ISO/IEC 23000-19 — standardises this single fMP4 segment shape so the same audio and video segments can serve both HLS and DASH players. Before CMAF, a service often stored MPEG-TS segments for Apple's HLS and separate MP4 segments for DASH — two copies of everything. CMAF lets one set of fMP4 segments feed both, which cuts packaging and storage roughly in line with not storing a second copy. How a player actually chooses among the audio segments it's offered is a streaming-layer question we cover in audio in HLS, DASH, CMAF.

Matroska and WebM: the flexible open container

Matroska (the .mkv file) is an open container built to hold an essentially unlimited number of tracks of any type, with rich metadata, chapters, and attachments. Its byte format is EBML (Extensible Binary Meta Language), a binary cousin of XML — a tree of named, length-prefixed elements, much like MP4's boxes but more general. As of 2024 Matroska is specified in an IETF document, RFC 9559, which moved the format from a community spec to a published standard.

Audio in Matroska lives in a TrackEntry element, and the field that names the codec is the CodecID, a text string. The mappings are readable on sight: A_OPUS for Opus, A_VORBIS for Vorbis, A_AAC for AAC, A_FLAC for FLAC, A_AC3 for Dolby Digital. The audio properties — sample rate, channel count, bit depth — sit in an Audio sub-element on the same track. One subtlety worth knowing: AAC stored in Matroska is stripped of its ADTS framing headers and muxed as raw frames, because Matroska supplies its own framing — a detail that bites you when a tool extracts the AAC and gets a stream a naive decoder can't parse without re-adding headers.

WebM is a strict subset of Matroska designed for the open web. It deliberately allows only royalty-free codecs: VP8/VP9/AV1 for video, and for audio the spec says the CodecID SHOULD be A_VORBIS or A_OPUS. In practice WebM audio in 2026 is almost always Opus (see the Opus codec). If you serve <video> in HTML with a WebM file, your audio is Opus inside a Matroska-subset container — three independent facts that people routinely collapse into "it's a WebM".

MPEG-TS: the broadcast container that thinks in tiny packets

MPEG-TS (the MPEG-2 Transport Stream, ISO/IEC 13818-1) is the oldest container here and still the most widely deployed in broadcast, cable, satellite, and a great deal of legacy HLS. It was designed for a hostile world — a one-way broadcast channel where bits get lost and a receiver might tune in at any moment — so it works completely differently from the file-tree containers above. Instead of one big file with a table of contents at the front, MPEG-TS is a relentless stream of small, fixed 188-byte packets, each stamped with a number that says which stream it belongs to.

That number is the PID (Packet Identifier), and it is the key to how MPEG-TS handles audio. Each elementary stream — one video stream, one audio stream — gets its own PID, and its compressed data is chopped into PES (Packetised Elementary Stream) units that are then sliced across the 188-byte transport packets carrying that PID. A receiver finds the audio it wants by filtering for the right PID and reassembling the PES from the packets. Because the format must let a receiver join mid-stream, the map of "which PID is which" is broadcast over and over in a small table called the PMT (Program Map Table). The PMT lists each stream's PID and a stream_type code that names the codec: 0x0F for AAC (in ADTS framing), 0x11 for AAC in LATM framing, 0x81 for AC-3 (the ATSC convention), and 0x03/0x04 for the older MPEG-1/2 audio layers. Multiple audio languages are simply multiple audio PIDs listed in the PMT, each with a language descriptor.

The practical takeaway: MPEG-TS trades efficiency for robustness. Repeating the PMT and stamping every 188-byte packet costs overhead a file container avoids, but it means a receiver can recover from loss and start playback from any point — which is exactly what broadcast and old-style HLS need. The cost is why the industry is migrating HLS to CMAF fMP4 wherever the player base allows.

Diagram of an MPEG-TS multiplex showing a continuous row of 188-byte packets colour-coded by PID, a Program Map Table listing audio and video PIDs with their stream_type codes, and PES units being sliced across packets Figure 3. MPEG-TS carries audio as a PID: the PMT maps each PID to a codec stream_type, and the audio PES is sliced across 188-byte packets so a receiver can join mid-stream.

A worked example: how much does a second audio track actually add?

A common worry is that adding audio tracks bloats the file. Let's size it with real arithmetic. Take a 90-minute feature, 48 kHz audio, and add one extra stereo AAC commentary track at a typical 128 kbps:

bitrate            = 128 kbps = 128,000 bits per second
duration           = 90 min   = 5,400 seconds
size (bits)        = 128,000 × 5,400 = 691,200,000 bits
size (bytes)       = 691,200,000 ÷ 8 = 86,400,000 bytes
size (MB)          = 86,400,000 ÷ 1,000,000 ≈ 86 MB

So a whole extra stereo language costs about 86 MB on a feature film — trivial next to a multi-gigabyte HD video track. A 5.1 AC-3 track at 384 kbps works out to three times that, roughly 259 MB, still small relative to the video. The container overhead to carry these tracks (the extra trak box, or the extra PID in TS) is negligible — kilobytes. This is why multi-language delivery is a track-and-storage decision, not a "can we afford it" decision; the real cost is at CDN scale across many titles, which we size in multi-language audio storage and CDN math.

A side-by-side comparison

Criterion MP4 / fMP4 Matroska / WebM MPEG-TS
Standard ISO/IEC 14496-12 / -14; CMAF 23000-19 RFC 9559 (Matroska); WebM subset ISO/IEC 13818-1
Structure Box tree (moov/mdat) EBML element tree (TrackEntry) 188-byte packets with PIDs
How audio is named stsd codec entry (mp4a + esds) CodecID string (A_OPUS, A_AAC) stream_type in the PMT
Typical audio codec AAC; Opus in fMP4/WebM contexts Opus (WebM), AAC/FLAC/AC-3 (MKV) AAC, AC-3, MP2
Multi-track audio Yes (multiple trak) Yes (many TrackEntry) Yes (multiple audio PIDs)
Best at Universal playback, streaming via CMAF Flexibility, archival, open web Broadcast, loss recovery, legacy HLS
Start-from-anywhere Only at segment boundaries (fMP4) Cluster boundaries Any point (designed for it)

A pitfall worth memorising: the silent-track packaging bug

The most common container-level audio failure is not a missing track — it is a present track that plays silence or refuses to decode, and it almost always comes from missing or wrong codec-setup metadata. In MP4 this is a missing or malformed esds (the AAC config bytes); in Matroska it is a missing CodecPrivate; in MPEG-TS it is a wrong stream_type or absent audio descriptor in the PMT. The file looks complete, plays in one tool that guesses well, and fails in a strict player like Safari or a hardware decoder that trusts the metadata literally. The lesson: when audio fails in one player but works in another, suspect the container's codec-setup box before you suspect the codec or the audio data itself. A remux that rebuilds the setup metadata fixes far more "broken audio" reports than a re-encode does.

Where Fora Soft fits in

Container choices surface in nearly every video product we build. In OTT and streaming work, the move to CMAF fMP4 means one set of audio segments serving HLS and DASH instead of two, which simplifies the packaging pipeline and the storage bill. In video conferencing and e-learning recording, the call is recorded into a container — often Matroska or MP4 — where the audio track's setup metadata and language tags have to be correct or downstream transcription and playback break. In surveillance and telemedicine, MPEG-TS and fMP4 both appear depending on whether the path is broadcast-style or web-style. Across all of them, the audio bugs that reach a user — wrong track, missing language, silent playback — are usually container decisions, which is why we treat packaging as a first-class part of the pipeline, not an afterthought.

What to read next

Call to action

References

  1. ISO/IEC 14496-12:2022, Information technology — Coding of audio-visual objects — Part 12: ISO base media file format (ISO/IEC, 2022). The controlling specification for the box structure (ftyp, moov, trak, mdia, minf, smhd, stbl, stsd), fragmented files (moof/mdat), and the audio sample-entry model used throughout this article. https://www.iso.org/standard/83102.html
  2. ISO/IEC 14496-14:2020, Information technology — Coding of audio-visual objects — Part 14: MP4 file format (ISO/IEC, 2020). The MP4-specific profile of ISOBMFF, including the mp4a audio sample entry and the esds box carrying the MPEG-4 ES_Descriptor / AudioSpecificConfig. https://www.iso.org/standard/79110.html
  3. ISO/IEC 23000-19:2024, Information technology — Multimedia application format (MPEG-A) — Part 19: Common media application format (CMAF) for segmented media (ISO/IEC, 2024). Defines the single fragmented-MP4 segment shape that lets one set of audio/video segments serve both HLS and DASH. https://www.iso.org/standard/85623.html
  4. RFC 9559, Matroska Media Container Format Specification (IETF, May 2024). The published specification for Matroska's EBML structure, TrackEntry, and CodecID-based codec mapping. https://datatracker.ietf.org/doc/rfc9559/
  5. Matroska Codec Mappings (Matroska.org / IETF CELLAR draft). The CodecID string table — A_OPUS, A_VORBIS, A_AAC, A_FLAC, A_AC3 — and the note that AAC is stored without ADTS headers. https://www.matroska.org/technical/codec_specs.html
  6. WebM Container Guidelines (The WebM Project). States that for WebM the audio CodecID SHOULD be A_VORBIS or A_OPUS, constraining WebM audio to royalty-free codecs. https://www.webmproject.org/docs/container/
  7. ISO/IEC 13818-1:2023, Information technology — Generic coding of moving pictures and associated audio information — Part 1: Systems (ISO/IEC, 2023). The MPEG-2 Transport Stream specification: 188-byte packets, PIDs, PES, the PMT, and stream_type codes for audio. https://www.iso.org/standard/83239.html
  8. ATSC A/53 Part 3:2013, Service Multiplex and Transport Subsystem Characteristics (ATSC, August 2013). The North American convention that maps AC-3 audio to stream_type 0x81 in the MPEG-TS PMT, distinguished here from the base MPEG-2 registrations. https://www.atsc.org/wp-content/uploads/2015/03/A53-Part-3-2013.pdf
  9. Fun with Container Formats: Fragmented MP4 & CMAF (Bitmovin engineering blog, accessed 2026-06-05). First-party deployer explanation of fMP4 init segments and moof/mdat media segments used to corroborate the streaming-segment description; the ISO standards above are authoritative where they differ. https://bitmovin.com/fun-with-container-formats-2/