Storage and CDN Math for Audio: How Much Does the Multi-Track Tax Cost?

Why this matters

If you run an over-the-top video service, an e-learning platform, a telemedicine product, or any app that ships recorded video to viewers in more than one language, someone on your team has either already asked "what do all these audio tracks cost us?" or is about to. This article is written for the product manager, founder, or operations lead who approves the cloud bill but has never been shown how audio actually accrues cost — and who needs to push back, intelligently, when an engineer proposes adding 5.1 in twelve languages "because we can". By the end you will be able to estimate the storage and delivery cost of any multi-track audio plan on the back of an envelope, you will know the two billing models content delivery networks use and which one your contract is on, and you will know exactly which tracks are safe to cut. Every number below is reproducible from public bitrate figures and 2026 cloud pricing, and the arithmetic is shown out loud so you can change the inputs and redo it yourself.

The one idea: audio is cheap per track, expensive per matrix

Start with the mental model that the rest of the article hangs on. A single audio track is tiny. The reason is simple: video has to describe millions of pixels changing thirty or sixty times a second, while audio only has to describe a pressure wave wiggling a few tens of thousands of times a second across one or two — occasionally eight or sixteen — channels. The result is a roughly hundred-to-one size difference between a video stream and the stereo audio that rides with it.

So why does anyone worry about audio cost at all? Because real services do not ship one audio track. They ship a matrix. Down one side of the matrix run the languages — English, Spanish, French, German, Japanese, and so on. Across the other side run the formats — plain stereo for phones, 5.1 surround for living-room sound bars, Dolby Atmos for premium home theatre. Multiply the two and you do not have one audio track riding along for free; you have dozens of full-length audio tracks, each one the full runtime of the title, all stored and all delivered.

That is the multi-track tax. It is not that any single track is expensive. It is that the number of tracks multiplies in two dimensions at once, and the cost multiplies with it. Keep this picture in your head: audio is cheap per track and expensive per matrix. Everything below is a consequence of working out how big that matrix gets and what each cell costs.

Diagram contrasting one video stream at roughly 5 megabits per second against a single stereo audio track at 128 kilobits per second, then expanding into a grid of languages times formats to show how many audio tracks a real title actually carries Figure 1. One audio track is about one percent of the video. A real title multiplies audio across languages and formats until the matrix, not the single track, drives the cost.

Step one: how big is one audio track?

Before we can cost a matrix, we need the cost of one cell. The size of an audio track depends on three things: its bitrate — the number of bits the codec spends per second of sound, measured in kilobits per second (kbps) — its duration, and nothing else. Storage size is just bitrate multiplied by time, then converted from bits to bytes.

Take the most common streaming audio track: stereo encoded with the AAC-LC codec — the workhorse Advanced Audio Coding format you will find on nearly every video platform — at 128 kbps. (Remember, a bitrate of 128 kbps means the codec produces 128 thousand bits for every second of audio.) For a 90-minute feature film, the math runs like this:

128,000 bits/second × 5,400 seconds = 691,200,000 bits
691,200,000 bits ÷ 8 = 86,400,000 bytes
86,400,000 bytes ÷ 1,000,000 = 86.4 megabytes (MB)

So one stereo AAC-LC track for a 90-minute film is about 86 megabytes. Now compare that to the video. A typical 1080p video stream runs around 5 megabits per second (Mbps) — note the capital M; that is millions of bits per second, and the styling convention is that audio is measured in kbps and video in Mbps. The same 90-minute film as video is:

5,000,000 bits/second × 5,400 seconds ÷ 8 ÷ 1,000,000 = 3,375 MB ≈ 3.4 gigabytes (GB)

The single stereo audio track is 86 MB against 3,375 MB of video — about 2.5 percent of the video size, or one part in forty. If you only ever ship one language in one format, audio genuinely is a rounding error and you can stop reading. The rest of this article is about what happens when you do not.

The bitrate table you will actually use

Here are the audio bitrates real services use, with the size each produces for a 90-minute title. These are the per-cell costs you will multiply.

Track type	Codec	Bitrate	90-min size	Notes
Low stereo (mobile)	HE-AAC v1	64 kbps	43 MB	Phone fallback rung
Standard stereo	AAC-LC	128 kbps	86 MB	The default everywhere
High stereo	AAC-LC	256 kbps	173 MB	Music-led content
5.1 surround	E-AC-3 (DD+)	384 kbps	259 MB	Living-room default
Dolby Atmos	E-AC-3 JOC	768 kbps	518 MB	Netflix premium ceiling
Dolby Atmos	AC-4	448 kbps	302 MB	Newer, more efficient

The arithmetic for every row is the same multiplication shown above; only the bitrate changes. Note one fact that surprises people: even Dolby Atmos, the most expensive audio format on the list, is 518 MB against 3,375 MB of a single 1080p video rung — still under one-sixth of one video stream, and your video ladder has several rungs. Audio never beats video on a per-track basis. The tax is entirely in the multiplication.

The Dolby numbers are worth pinning down because they are the ones people guess wrong. Streaming services cap Dolby Digital Plus with Joint Object Coding — the format that carries Atmos inside a Dolby Digital Plus stream, abbreviated DD+ JOC — at 768 kbps. Netflix raised its Atmos bitrate to that ceiling in May 2019, up from 448 kbps, and describes 768 kbps as the point above which the extra quality is imperceptible. The newer AC-4 codec delivers comparable Atmos quality at 448 kbps, which is why services migrating to AC-4 see their immersive-audio cost drop by roughly forty percent per track.

Step two: how big does the matrix get?

Now we multiply. Consider a real example, the kind a global streaming service ships every week: a 90-minute feature, offered in 8 languages, each language available in three formats — stereo 128 kbps, 5.1 at 384 kbps, and Atmos at 768 kbps.

That is 8 × 3 = 24 audio tracks. The storage for the audio matrix is the sum of all cells. Per language you carry one of each format:

Per language: 86 MB (stereo) + 259 MB (5.1) + 518 MB (Atmos) = 863 MB
All 8 languages: 863 MB × 8 = 6,904 MB ≈ 6.9 GB of audio per title

Now put that next to the video. A modern video ladder for a 1080p title carries maybe six bitrate rungs, from a 400 kbps mobile rung up to an 8 Mbps top rung, and the whole ladder for a 90-minute film lands somewhere around 12 to 15 GB. So our audio matrix — 6.9 GB across 24 tracks — is genuinely comparable to the entire video ladder. This is the moment the "audio is free" intuition breaks. At eight languages with three formats each, audio is no longer a rounding error; it is roughly half the size of the video.

Push it further and it gets worse. Add a low-bitrate mobile stereo rung per language (an audio ladder, not just one stereo rung), add a descriptive-audio track for accessibility in your top three languages, add a 256 kbps high-stereo rung for music-heavy titles, and the matrix can reach 40-plus tracks and exceed the video ladder outright. A common mistake is to treat each addition as "just one more small track" — true per track, but the matrix is where small numbers become a large one.

A grid showing eight language rows and three format columns, each cell labelled with its megabyte size, with running totals down the side and across the bottom, ending in the 6.9 gigabyte audio-matrix total set beside a 13 gigabyte video-ladder bar for scale Figure 2. The eight-language, three-format matrix for a 90-minute feature: 24 tracks, 6.9 GB — comparable to the full video ladder.

Step three: storage cost — cheap and predictable

Storage is the easy bill. Cloud object storage — the kind of bucket where you keep your media files, such as Amazon S3 — costs roughly $0.023 per gigabyte per month for standard hot storage in 2026. Our 6.9 GB audio matrix therefore costs:

6.9 GB × $0.023/GB/month = $0.16 per title per month

Sixteen cents a month to store every audio track for that film. Across a catalogue of 5,000 titles with the same audio plan:

6.9 GB × 5,000 titles = 34,500 GB ≈ 34.5 terabytes (TB)
34,500 GB × $0.023 = $794 per month for all audio storage

Under $800 a month to store the audio for a five-thousand-title catalogue. Storage is not where the multi-track tax hurts. It is real, it is predictable, and it scales linearly with the matrix — but it is small. The lesson: do not optimise audio to save storage. Optimise it to save delivery, which is the next and much larger bill.

Step four: CDN egress — where the tax actually bites

Delivery is the bill that matters, and it works differently from storage. A content delivery network, abbreviated CDN, is the chain of edge servers that holds copies of your files close to viewers so the bytes travel a short distance. (Think of a CDN as a chain of corner stores: the same product, stocked closer to where each customer lives.) You pay the CDN for every byte that leaves its edge toward a viewer — this outbound traffic is called egress. And egress is charged in one of two models, which you must know to reason about your bill.

Model A — per gigabyte

The simplest model: you pay a flat rate for every gigabyte delivered. 2026 public rates span a wide range — Amazon CloudFront in the US is around $0.085 per GB at low volume, falling toward $0.02 per GB at high committed volume; Bunny.net starts near $0.01 per GB; Cloudflare does not charge a per-gigabyte egress fee for standard cached delivery at all. The per-GB model is easy to reason about: total cost is bytes delivered times the rate.

Here is the key insight for audio. Egress is driven by what viewers actually watch, not by what you store. A viewer streams exactly one audio track per play — their chosen language in their device's best supported format. So the audio egress for one play is the size of one track, not the whole matrix:

One play, 90-min film, stereo 128 kbps track = 86 MB delivered
At $0.085/GB: 0.086 GB × $0.085 = $0.0073 per play (less than a cent)
For 1,000,000 plays: 86,000 GB × $0.085 = $7,310 of audio egress

Now the same million plays of the video, at the average rung of roughly 4 Mbps for a 90-minute film (about 2,700 MB delivered per play):

2,700 MB × 1,000,000 plays = 2,700,000 GB
2,700,000 GB × $0.085 = $229,500 of video egress

Audio egress is $7,310 against video's $229,500 — about 3 percent of the delivery bill. This is the number to anchor on: in per-GB billing, audio is a small, steady share of egress as long as each viewer streams one track. The matrix size affects storage but not egress, because nobody downloads the whole matrix. That single fact rules out half the panic about audio cost.

Model B — 95th percentile (sustained megabits per second)

The second model is the one that surprises finance teams. Large streaming contracts — and most peering and transit arrangements — bill not by total gigabytes but by your sustained peak bandwidth, using the 95th-percentile method. Here is how it works, step by step. The provider samples your outbound traffic rate every five minutes for the whole month — that is 8,640 samples in a 30-day month. It sorts those samples from lowest to highest, throws away the top five percent (the 432 highest samples, your brief spikes), and bills you for the highest sample that remains, expressed in megabits per second.

The point of dropping the top five percent is to forgive short bursts — a sudden premiere, a flash crowd — and charge you instead for the bandwidth you sustain. In this model your bill is set by your busy-hour concurrency: how many streams are flowing at your peak moment, multiplied by the bitrate of each stream.

This changes how audio cost behaves. Add a richer default audio format and you raise the per-stream bitrate at peak, which raises your 95th-percentile number, which raises the bill — even though no single viewer downloaded more files. Worked example: suppose your busy-hour peak is 100,000 concurrent streams. If your default audio is stereo at 128 kbps, audio contributes:

100,000 streams × 128 kbps = 12,800,000 kbps = 12.8 gigabits per second (Gbps) at peak

If you switch the default to 5.1 at 384 kbps, audio's peak contribution triples to 38.4 Gbps; switch the default to Atmos at 768 kbps and it is 76.8 Gbps. Against a video peak of 100,000 × 4 Mbps = 400 Gbps, stereo audio adds about 3 percent to the 95th-percentile number, but an Atmos default would add nearly 20 percent. The pitfall here is making a heavy format the default rather than an opt-in: in 95th-percentile billing, the default format multiplies across your entire concurrent audience at peak, and that is the one audio decision that can move the bill by double digits.

A two-panel diagram: the left panel shows the per-gigabyte model as a stack of delivered files with a dollar-per-gigabyte tag; the right panel shows the 95th-percentile model as a month of five-minute bandwidth samples sorted ascending, the top five percent shaded out, and an arrow marking the billed sample in gigabits per second Figure 3. The two CDN billing models. Per-gigabyte tracks total bytes; 95th-percentile tracks sustained peak bandwidth and is sensitive to your default audio format.

Step five: the pruning rules that cut the bill safely

Now the payoff. Here is where to cut, in priority order, and why each cut is safe.

Prune by viewer telemetry, not by guesswork. The single biggest waste is storing and packaging tracks nobody plays. If your analytics show that a title in Finnish gets 40 plays a month and the Finnish 5.1 track has never been selected, that 5.1 track is paying storage and packaging cost for zero egress benefit. Pull per-language, per-format selection counts from your player analytics and cut the long tail. This is per-title pruning: not every title needs the full matrix; a niche documentary does not need Atmos in twelve languages.

Do not build an audio ABR ladder unless you have proven you need one. Adaptive bitrate, abbreviated ABR, means offering several bitrate rungs so the player can switch down on a weak connection. Video needs many rungs because video bitrate is huge and varies enormously. Audio bitrate is small enough that one well-chosen rung — typically 128 kbps stereo, with a single 64 kbps fallback for genuinely constrained mobile networks — covers nearly everyone. A five-rung audio ladder multiplies your stereo storage fivefold to defend against a problem most viewers never have. (Our companion article on audio adaptive bitrate ladders walks through exactly when the second rung earns its keep.)

Make heavy formats opt-in, never default. As the 95th-percentile math showed, the default format multiplies across your whole concurrent audience. Default to stereo; let the device and the viewer's settings request 5.1 or Atmos only when the playback chain supports it. This costs you nothing in quality — a phone cannot render Atmos anyway — and protects your peak-bandwidth bill.

Migrate immersive audio to AC-4 where your devices support it. AC-4 delivers Atmos quality at 448 kbps against DD+ JOC's 768 kbps — about forty percent less per track. On a large Atmos catalogue that is a real saving on both storage and the 95th-percentile peak, with no perceptible quality loss.

Share audio across video renditions. In HLS, DASH, and CMAF, audio lives in its own rendition group, separate from the video — so one audio track pairs with every video rung, not one per rung. If your packager is duplicating audio per video rendition, you are paying for the matrix several times over for no reason. (See our companion piece on how audio rides inside HLS, DASH, and CMAF for the packaging detail.)

How the standards keep audio separate from video

The reason audio egress stays small — one track per play, not the whole matrix — is baked into the streaming standards, and it is worth understanding so you can confirm your packager does it right.

In HTTP Live Streaming, defined in IETF RFC 8216, alternate audio tracks are declared with EXT-X-MEDIA tags that group renditions by GROUP-ID. The video variants reference an audio group rather than embedding audio, so the player downloads one video rendition and one audio rendition and muxes them at playback. In MPEG-DASH, standardised as ISO/IEC 23009-1, audio and video each get their own AdaptationSet inside the manifest, and the player picks one Representation from each. The Common Media Application Format, ISO/IEC 23000-19, lets a single set of audio segments serve both your HLS and your DASH packaging, which halves audio storage if you previously packaged audio twice. The net effect of all three: audio is addressed and delivered independently of video, so a viewer's egress is one audio track, never the matrix. If your storage and egress numbers suggest otherwise, your packaging configuration — not the standard — is the bug.

Where Fora Soft fits in

We build video products where audio cost is a real line item: over-the-top and Internet-TV platforms with multi-language catalogues, e-learning systems that ship lectures in several languages, and telemedicine and conferencing apps where recorded sessions accumulate audio tracks over time. The recurring pattern we see is over-provisioning — full surround ladders in languages with negligible viewership, heavy formats set as the default, and audio duplicated per video rendition by a misconfigured packager. The fix is almost never a worse experience for viewers; it is pruning by telemetry and correcting the packaging so audio is shared, not multiplied. If you are sizing the audio bill for a new service or auditing an existing one, this is the kind of analysis we do early in a project.

Call to action

Talk to a audio engineer — book a 30-minute scoping call to talk through your audio cdn cost plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Audio Storage & CDN Cost Cheat Sheet — One-page reference: per-codec track size for a 90-minute film (HE-AAC 64 kbps to Atmos 768 kbps), the language x format matrix formula and totals, both CDN billing models (per-GB and 95th-percentile), and the five pruning rules.

References

IETF RFC 8216, HTTP Live Streaming (R. Pantos, Ed., August 2017) — EXT-X-MEDIA rendition groups (GROUP-ID) that separate alternate audio from video variants. https://www.rfc-editor.org/rfc/rfc8216 — Tier 1 (official spec).
ISO/IEC 23009-1:2022, Dynamic adaptive streaming over HTTP (DASH) — Part 1: Media presentation description and segment formats — audio and video as separate AdaptationSets. Catalogue: https://www.iso.org/standard/83314.html — Tier 1 (normative text paywalled; cited from catalogue + DASH-IF guidelines).
ISO/IEC 23000-19:2024, Common Media Application Format (CMAF) for segmented media — single audio segment set shared across HLS and DASH packaging. Catalogue: https://www.iso.org/standard/85623.html — Tier 1 (paywalled; cited from catalogue).
Amazon Web Services, Amazon S3 pricing (accessed 2026-06-06) — standard storage ≈ $0.023/GB-month. https://aws.amazon.com/s3/pricing/ — Tier 4 (vendor pricing).
Amazon Web Services, Amazon CloudFront pricing (accessed 2026-06-06) — US egress ≈ $0.085/GB at low volume, lower with commit. https://aws.amazon.com/cloudfront/pricing/ — Tier 4 (vendor pricing).
Bunny.net, Network 95th percentile academy article (accessed 2026-06-06) — definition of 95th-percentile billing and per-GB rates from $0.01/GB. https://bunny.net/academy/cdn/what-is-network-95th-percentile/ — Tier 6 (educational).
FlatpanelsHD, Netflix debuts 'high-quality audio'; higher bitrate for Dolby Atmos & 5.1 (May 2019) — Netflix DD+ JOC Atmos raised to 768 kbps from 448 kbps. https://www.flatpanelshd.com/news.php?subaction=showfull&id=1556717977 — Tier 4 (industry press, corroborated).
TechRadar, Dolby Atmos on streaming... blind test results (2024) — AC-4 L4 at 448 kbps vs DD+ JOC at 768 kbps listening comparison. https://www.techradar.com/televisions/blu-ray/dolby-atmos-on-streaming-will-finally-sound-as-good-as-4k-blu-ray-based-on-these-blind-test-results — Tier 5 (test reporting).
Stackscale, What's the 95th percentile billing method? (accessed 2026-06-06) — 5-minute sampling, drop top 5%, bill the next peak. https://www.stackscale.com/blog/95-percentile-metering-billing-bandwidth/ — Tier 6 (educational, corroborates Bunny.net).

Note on source tiers: this article's protocol claims (audio/video separation in HLS, DASH, CMAF) rest on Tier-1 standards (RFC 8216, ISO/IEC 23009-1, ISO/IEC 23000-19). Where standards bodies disagree with vendor practice we follow the standard; here there is no conflict — the cost behaviour is a direct consequence of the standards' rendition-group design. Pricing and bitrate figures are vendor/industry sources, dated, and reproducible from the shown arithmetic.

Storage and CDN Math for Audio: How Much Does the Multi-Track Tax Cost?

Why this matters

The one idea: audio is cheap per track, expensive per matrix

Step one: how big is one audio track?

The bitrate table you will actually use

Step two: how big does the matrix get?

Step three: storage cost — cheap and predictable

Step four: CDN egress — where the tax actually bites

Model A — per gigabyte

Model B — 95th percentile (sustained megabits per second)

Step five: the pruning rules that cut the bill safely

How the standards keep audio separate from video

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Storage and CDN Math for Audio: How Much Does the Multi-Track Tax Cost?

Why this matters

The one idea: audio is cheap per track, expensive per matrix

Step one: how big is one audio track?

The bitrate table you will actually use

Step two: how big does the matrix get?

Step three: storage cost — cheap and predictable

Step four: CDN egress — where the tax actually bites

Model A — per gigabyte

Model B — 95th percentile (sustained megabits per second)

Step five: the pruning rules that cut the bill safely

How the standards keep audio separate from video

Where Fora Soft fits in

What to read next

Call to action

References

Related glossary terms

Bitrate

Stereo

Dolby Atmos

AC-4

HLS

AAC

CMAF

AAC-LC