Why this matters

If you build VR or AR experiences, 360° video, immersive conferencing, or any product where sound is supposed to come from somewhere rather than just from your headphones, these three technologies are the entire toolkit, and choosing among them sets your bandwidth, your latency, and your perceived quality. This article is written for a product manager, founder, or operations lead who needs to scope a spatial-audio feature, brief an engineering team, or judge a vendor's "immersive audio" claim — and who has never heard of a spherical harmonic. A senior audio engineer will read it too and must find every number accurate; the localization tolerances, channel counts, and frequency boundaries here trace to standards bodies and primary research, not to marketing pages. By the end you will understand how a sound field is captured, how your ears decode direction, and how the two meet in the renderer that drives every pair of spatial-audio earbuds shipping today.


The problem: headphones have only two channels, the world has infinite directions

Start with the puzzle every spatial-audio system has to solve. In a real room, sound reaches you from every direction at once — a voice in front, a door closing behind, footsteps to the left. Your two ears, and the brain behind them, sort all of that into a 3D map without any conscious effort. But a pair of headphones delivers exactly two streams of sound, one per ear. The entire field of spatial audio is the science of feeding those two channels signals so cleverly shaped that your brain rebuilds the full 3D map as if you were in the original room.

Two channels sounds like too few, and for a long time it was. The breakthrough is that your brain does not actually need the original sound field — it only needs the two ear signals that would have arrived if the field were real. If a system can compute those two signals, the illusion is complete. That computation is what this article is about, and it splits cleanly into three jobs: describe the field (ambisonics), measure how a head turns a direction into an ear signal (the HRTF), and run the field through the measurement to get the two channels (binaural rendering).

A note on scope. Spatial audio for fixed speaker layouts — 5.1, 7.1, 7.1.4 — is a different problem, covered in channels and channel layouts and the immersive-format deep dives on Dolby Atmos and MPEG-H 3D Audio. This article is about the headphone case, because headphones are where VR, AR, mobile, and most conferencing actually live.

How your brain locates a sound: three cues

Before any technology, the biology. Your brain uses three physical cues to place a sound in space, and every spatial-audio system on earth is just a machine for reproducing those three cues on headphones. Understanding them is the key that makes the rest of the article click.

The first cue is timing, called the interaural time difference, or ITD — the gap between when a sound reaches one ear and when it reaches the other. A sound on your right hits your right ear a fraction of a millisecond before your left, because the left ear is farther away. That gap is tiny — at most about 0.6 to 0.7 milliseconds for a sound directly to the side — but the brain reads it precisely. ITD is the dominant cue for low-frequency sounds; the duplex theory of localization, the century-old framework still used today, puts the crossover at roughly 1,500 Hz: below it, your brain leans on timing (Wikipedia, Sound localization; multiple acoustics references).

The second cue is loudness, called the interaural level difference, or ILD — the difference in volume between the two ears. Your head is a physical obstacle, and at high frequencies it casts an acoustic "shadow," so the far ear hears a quieter version. Above about 1,500 Hz, the ILD takes over as the primary cue, because high frequencies are short enough to be blocked by the head (acoustics references; duplex theory).

ITD and ILD together tell your brain left versus right beautifully — but they leave a famous ambiguity. A sound directly in front and a sound directly behind produce nearly identical timing and level at both ears, because both are equidistant. So does a sound above versus a sound at the same angle below. This is the "cone of confusion," and the third cue resolves it.

The third cue is the pinna, the visible outer ear. Its folds and ridges filter incoming sound differently depending on the angle it arrives from, carving small dips — spectral notches — into the frequency content. The center frequency of the most prominent notch shifts predictably as a sound moves up or down, which is how your brain reads elevation. These cues live in the high frequencies: the most pronounced pinna filtering sits in the 4,000 to 9,000 Hz band, and reliable elevation judgment generally requires sound energy above roughly 7,000 Hz (research summarized in multiple localization reviews). The pinna is also intensely personal — no two people's ears are shaped the same — which is exactly why personalized spatial audio exists, as we will see.

Diagram showing the three binaural localization cues: a head viewed from above with a sound source to the right, labeled with interaural time difference and interaural level difference and the 1500 Hz crossover, plus a side view of a pinna with spectral notch frequencies in the 4 to 9 kHz band for elevation Figure 1. The three cues your brain uses to place a sound. Timing (ITD) and level (ILD) handle left-right, split at about 1,500 Hz; the pinna's spectral notches handle front-back and up-down.

The HRTF: your head turned into math

Now bundle all three cues into one object. The head-related transfer function, or HRTF, is a mathematical description of how a sound from one specific direction is reshaped by your head, torso, and ears before it reaches each eardrum. It captures the ITD, the ILD, and the pinna notches in a single measurement — for one direction. Measure enough directions across the full sphere around a listener and you have that listener's complete HRTF set: a lookup table that answers the question "if a sound comes from there, what exactly arrives at each ear?"

The HRTF is the single most important object in headphone spatial audio, so it is worth being precise about what it is. For each direction — say, 30° to the right and 15° up — the HRTF stores a pair of short filters, one per ear. Play any sound through the correct filter pair and it acquires all the directional cues of that direction. The brain hears the result over plain headphones and concludes, correctly, that the sound is at 30° right and 15° up.

Where do HRTFs come from? They are measured. A subject sits in an anechoic chamber with tiny microphones in their ear canals, and a speaker plays test signals from hundreds of positions around them; the recorded ear signals, compared to the source, are the HRTF. Because doing this per person is expensive, the industry relies on shared research databases — CIPIC, SADIE, LISTEN, FABIAN, HUTUBS, ARI and dozens more — many of which are published in a common file format we will meet in a moment (SOFA Conventions database listings, 2024–2025). A product that ships "generic" spatial audio is almost always using one well-regarded database HRTF, often measured on a dummy head, for everyone.

Generic versus personalized HRTF

A generic HRTF is a compromise: it is built from someone else's ears, so its pinna notches sit at frequencies that may not match yours. For most people a good generic HRTF still produces convincing left-right placement and a usable sense of space, because ITD and ILD depend mostly on head size and ear spacing, which vary less than pinna shape. The weakness shows up in elevation and front-back judgments — the cues that depend on the pinna — where a mismatched HRTF causes errors: sounds meant to be in front collapse to inside the head, or elevation flattens out.

This is why personalized HRTF has become a 2026 product battleground. Apple's Personalized Spatial Audio asks you to scan your head and ears with the iPhone's front camera — turning your head left until a chime, then right — to build a profile tuned to your anatomy (Apple Support, Control Spatial Audio and head tracking, 2024–2025). Dolby's Atmos Personalized Rendering takes photographs of your head and ears and computes a personalized HRTF, abbreviated PHRTF, in the cloud (audioXpress, 2024). Research increasingly uses machine learning to predict an HRTF from a photo or a few anthropometric measurements rather than a full anechoic measurement, and a December 2024 review surveys the physical-modeling, anthropometric, and ML approaches now competing (MDPI Applied Sciences, A Review on HRTF Generation for Spatial Audio, 2024).

Pitfall — "spatial audio sounds like it's inside my head." This is the classic symptom of an HRTF mismatch (or no HRTF at all — just panned stereo). If externalization is poor, do not reach for more reverb; verify that a real HRTF is being applied and that head-tracking is active. The fix is almost always the rendering chain, not the content.

SOFA and AES69: the file format that made HRTFs portable

For years, every HRTF database stored its measurements in a different bespoke format, and using a new database meant writing a new parser. That ended with SOFA — the Spatially Oriented Format for Acoustics — a standardized file format for HRTFs and the related room impulse responses used in spatial audio. SOFA is not a vendor product; it is an open standard, ratified by the Audio Engineering Society as AES69, first published as AES69-2015 and reaffirmed as AES69-2020 and AES69-2022 (SOFA Conventions; AES69). A SOFA file (.sofa) holds the measured filters plus the exact geometry — source positions, listener orientation — so any compliant tool can load any database. If you are evaluating a spatial-audio SDK, "reads SOFA / AES69 HRTFs" is a concrete capability to ask for; it means you can swap in better or personalized HRTFs later instead of being locked to the vendor's built-in set.

Ambisonics: storing a whole sphere of sound

The HRTF answers "what arrives at the ears from one direction." But a real scene has sound coming from all directions at once, and head-tracking means the listener can rotate the whole scene at will. You need a way to store the entire sound field — a full sphere — in a form that is easy to rotate and easy to feed through HRTFs. That representation is ambisonics.

Here is the core idea in plain terms. Instead of storing "a sound at this speaker and a sound at that speaker," ambisonics stores the sound field as a set of overlapping directional patterns that, added together, reconstruct the pressure at the center of the sphere from every angle. Think of it like describing a landscape not as a list of objects but as a smooth mathematical surface — a few broad shapes that capture the big features, then finer shapes that add detail. Those shapes are called spherical harmonics, and you do not need the math; you need the consequence: more shapes means more spatial detail.

Order and channel count: the one formula that matters

The number of shapes is set by the ambisonic order. First-order ambisonics uses the four broadest patterns; higher orders add finer ones. The channel count follows one clean formula — for order N, you need (N + 1)² channels (RingBuffer, Understanding Ambisonics; standard ambisonics references). Plug in the numbers and the trade-off is immediate:

order 1 (FOA):  (1 + 1)² = 4 channels
order 2:        (2 + 1)² = 9 channels
order 3 (TOA):  (3 + 1)² = 16 channels
order 7:        (7 + 1)² = 64 channels

First-order ambisonics, abbreviated FOA, is four channels — cheap, universally supported, and the format YouTube accepts for 360° video. Third-order ambisonics, TOA, is sixteen channels — four times the data, noticeably sharper localization. The reason higher order helps is direct: more spherical-harmonic patterns let the field be reconstructed with finer angular resolution, so a sound's direction is rendered more precisely. Listening research confirms it — localization accuracy improves progressively across orders 1, 3, and 5 (peer-reviewed localization studies, e.g. AES/academic work on first vs higher-order reproduction). The cost is equally direct: every extra order is more channels to encode, store, and transmit.

Order Common name Channels Spatial detail Typical use
1 FOA 4 Coarse — broad directions YouTube 360°, low-bandwidth VR
2 9 Better Facebook 360 (historical), mid VR
3 TOA 16 Sharp localization High-end VR, immersive production
7 64 Reference-grade Research, studio monitoring

Table 1. Ambisonic order versus channel count, following (N+1)². The jump from coarse to sharp localization is the jump from 4 to 16 channels. Sources: RingBuffer; standard ambisonics references; SSA Plugins on higher-order ambisonics.

AmbiX, ACN, and SN3D: why your file actually plays

Two files can both be "first-order ambisonics" and still be incompatible, because the four channels can be ordered and scaled in different conventions. The convention that won, and the one you should standardize on, is AmbiX. AmbiX pins down two choices. First, channel order follows ACN — Ambisonic Channel Numbering — which numbers the spherical harmonics by a fixed formula so the four FOA channels are W, Y, Z, X in that exact order. Second, the channels are scaled with SN3D normalization (Schmidt semi-normalization). Google adopted SN3D as the basis for the YouTube 360 format, which cemented AmbiX as the de-facto exchange standard (Wikipedia, Ambisonic data exchange formats; RingBuffer).

The older FuMa (Furse-Malham) convention orders channels W, X, Y, Z and scales them differently; you will still meet it in legacy material and some plugins. The practical rule: produce and store AmbiX (ACN/SN3D), and convert FuMa at the boundary. YouTube's own spatial-audio specification is explicit — it requires AmbiX with ACN ordering and SN3D normalization, FOA delivered as a 4-channel W,Y,Z,X track at 48 kHz, or 6 channels (W,Y,Z,X,L,R) when you add head-locked stereo for narration or music that should not rotate with the head (YouTube spatial audio specification / Google spatial-media, 2024–2025).

Pitfall — channel-order mismatch. Feeding a FuMa file to an AmbiX decoder (or vice versa) does not error out — it plays, but the sound field is scrambled: left becomes up, front becomes a smear. If a spatial mix sounds "wrong but not broken," suspect an ACN/FuMa or SN3D/maxN normalization mismatch before anything else.

Binaural rendering: where ambisonics meets the HRTF

Now the two halves connect. You have a sound field stored in ambisonics and a listener whose ears are described by an HRTF. Binaural rendering is the step that combines them into the two-channel signal the headphones play. Conceptually it works in two moves.

The classic, easy-to-picture method is the virtual loudspeaker approach. Imagine surrounding the listener with a ring or sphere of imaginary speakers. First, decode the ambisonic field to those virtual speaker positions — standard ambisonic decoding, the same math used for real speaker arrays. Then, for each virtual speaker, look up the listener's HRTF for that speaker's direction, filter the speaker's signal through it, and sum all the results into a left and a right channel. The output is two channels that carry the directional cues of the entire field. It is intuitive and correct, but it is also a lot of HRTF convolutions — one pair per virtual speaker, every audio frame.

The method that has become the modern default skips the virtual speakers and works directly in the spherical-harmonic domain: pre-combine the HRTFs into the ambisonic representation once, so rendering the whole field costs only as many filter operations as there are ambisonic channels. The dominant technique here is Magnitude Least Squares, abbreviated MagLS. Low-order ambisonics cannot perfectly reproduce the fine high-frequency detail of an HRTF, and trying to match both the level and the fine timing (phase) of the HRTF at high frequencies forces ugly compromises. MagLS exploits a fact from the biology section: above about 1,500 Hz your brain reads level (ILD), not fine phase. So MagLS matches the HRTF's magnitude accurately and discards the high-frequency phase it cannot reproduce, spending the limited low-order budget where the ear actually notices. MagLS is now the de-facto standard for low-order binaural rendering, and active research — masked MagLS, end-to-end MagLS — keeps refining it (arXiv 2501.18224, Ambisonics Binaural Rendering via Masked Magnitude Least Squares, 2025; DAGA 2023; eMagLS, 2024).

Signal-flow diagram of binaural rendering: an ambisonic sound field on the left flows two ways, the virtual-loudspeaker path decoding to a ring of speakers then HRTF-filtering each, and the direct spherical-harmonic MagLS path combining HRTFs once, both converging to a two-channel headphone output, with a head-rotation arrow feeding a rotation step before rendering Figure 2. Two routes from an ambisonic field to headphones. The virtual-loudspeaker path is intuitive; the direct MagLS path is what ships. Head rotation is applied to the field before rendering, which is why it feels instant.

Head-tracking: the cue that makes it real

The single feature that separates convincing spatial audio from a gimmick is head-tracking — rotating the sound field to compensate when the listener turns their head, so a sound source stays fixed in the room rather than glued to the head. This is where ambisonics earns its keep. Because the field is stored as spherical harmonics, rotating the entire scene is a single matrix operation applied to the channels — fast, exact, and done before the HRTF step. Turn your head 30° to the right and the system rotates the field 30° to the left in the same frame, so the voice that was in front of you stays in front of you.

Head-tracking matters for two reasons. It dramatically improves externalization — the sense that sound is outside your head — because the brain treats a source that stays put when you move as real. And it resolves the front-back confusion left over from the ITD/ILD cues: a tiny head movement changes the cues differently for a front source than a back source, and the brain uses that to disambiguate. This is exactly what AirPods and similar earbuds do — their motion sensors track head position and the audio is rotated to stay anchored to the iPhone or Mac (Apple Support, 2024–2025). The catch is latency: the rotation must follow the head within a few tens of milliseconds, or the lag itself becomes a cue that the world is fake. Spatial-audio head-tracking targets motion-to-sound latency low enough to stay imperceptible, the same discipline that governs the broader real-time chain in the WebRTC audio pipeline end-to-end.

Putting it together: three real-world stacks

The three building blocks combine differently depending on the product. Three concrete stacks cover most of what ships.

YouTube 360° / VR video. The creator authors a first-order ambisonic mix (AmbiX, ACN/SN3D), optionally with a head-locked stereo pair for narration. YouTube stores the four-channel field and, on playback, rotates it to the viewer's head orientation and binaurally renders it with Google's Resonance Audio HRTFs — the same HRTFs Google open-sourced for VR (YouTube spatial audio spec; SSA Plugins on Resonance Audio HRTFs, 2018). Four channels keeps bandwidth modest, which is why FOA is the web-scale choice even though higher orders localize better.

A VR headset game or experience. Here the engine usually renders objects directly to binaural per source, or uses third-order ambisonics for the ambient bed plus object-based rendering for key sounds — sixteen channels of TOA buy the sharper localization that a head-mounted display's visuals demand. The HRTF is generic by default, with a personalization option growing more common.

Immersive conferencing. Spatial conferencing places each remote participant at a distinct virtual position so your brain can separate overlapping talkers — the "cocktail party" benefit. This typically renders each talker's mono stream through an HRTF for their assigned seat, summed to binaural, rather than carrying a full ambisonic field; it is lighter weight and integrates with the real-time pipeline and mixing covered in audio in SFU vs MCU vs P2P and group calls at scale. Spatializing voices measurably improves intelligibility when several people talk at once, which is the practical payoff for a conferencing product.

A worked example: the bandwidth cost of going from FOA to TOA

Suppose you stream a 360° concert and you are deciding between first-order and third-order ambisonics, both as Opus at a typical 64 kbps per channel. The channel counts come straight from (N+1)²:

FOA:  4 channels × 64 kbps =  256 kbps
TOA: 16 channels × 64 kbps = 1,024 kbps

That is a 4× bandwidth increase — 256 kbps versus just over 1 Mbps — to move from coarse to sharp localization. For a web-delivered experience to ordinary earbuds, FOA's coarser field is usually the right call: the visual is the star, the audio supports it, and 256 kbps streams to anyone. For a tethered VR headset where the user's head moves constantly and the audio carries the immersion, TOA's extra megabit is well spent. The decision is not "which is better" — TOA always localizes better — but "is the localization gain worth 4× the bits for this product and this network," the same engineering trade-off explored for the multi-track case in audio storage and CDN cost math.

Where Fora Soft fits in

Spatial audio shows up across the products we build. In AR/VR work, ambisonic beds with head-tracked binaural rendering are what make a scene feel inhabited rather than narrated. In video conferencing and e-learning, placing each speaker at a distinct virtual position helps participants follow overlapping conversation, the same intelligibility win that matters in telemedicine consultations with several people in the room. In OTT and Internet-TV products, binaural rendering is how an immersive mix authored for speakers reaches the large share of viewers who watch on headphones. Across all of these, the recurring engineering questions are the ones this article frames: which ambisonic order, generic or personalized HRTF, and how to keep head-tracking latency low enough to preserve the illusion.

What to read next

Call to action

References

  1. SOFA Conventions / Audio Engineering Society — SOFA (Spatially Oriented Format for Acoustics), standardized as AES69-2015, reaffirmed AES69-2020 and AES69-2022. https://www.sofaconventions.org/ (accessed 2026-06-07). Primary standard for HRTF/BRIR file exchange.
  2. YouTube / Google — Use spatial audio in 360-degree and VR videos and the Spatial Audio RFC (google/spatial-media). Requires AmbiX, ACN ordering, SN3D normalization; FOA W,Y,Z,X at 48 kHz; 6-channel head-locked variant. https://support.google.com/youtube/answer/6395969 (accessed 2026-06-07). Primary platform specification.
  3. Wikipedia — Ambisonic data exchange formats (ACN, SN3D, N3D, FuMa; AmbiX convention). https://en.wikipedia.org/wiki/Ambisonic_data_exchange_formats (accessed 2026-06-07). Orientation reference; format conventions cross-checked against YouTube spec and RingBuffer.
  4. RingBuffer — Understanding Ambisonics and Understanding Ambisonics Signals (spherical harmonics, order, (N+1)² channel count, ACN n = ℓ(ℓ+1)+m). https://ringbuffer.org/spatial_audio/ambisonics/ (accessed 2026-06-07).
  5. Wikipedia — Sound localization (duplex theory; ITD vs ILD crossover ~1,500 Hz; pinna spectral cues). https://en.wikipedia.org/wiki/Sound_localization (accessed 2026-06-07). Cross-checked against acoustics references; duplex theory is the standard framework.
  6. Localization-cue research — pinna spectral notches dominate ~4,000–9,000 Hz; reliable elevation localization needs energy above ~7,000 Hz (summarized across peer-reviewed reviews of monaural spectral cues). Accessed 2026-06-07.
  7. A Review on Head-Related Transfer Function Generation for Spatial Audio, MDPI Applied Sciences 14(23):11242 (December 2024). Physical-modeling, anthropometric, and ML approaches to HRTF generation. https://www.mdpi.com/2076-3417/14/23/11242 (accessed 2026-06-07).
  8. Ambisonics Binaural Rendering via Masked Magnitude Least Squares, arXiv:2501.18224 (2025); with Magnitude-Least-Squares Binaural Ambisonic Rendering with Phase Continuation (DAGA 2023) and eMagLS (2024). MagLS as the de-facto low-order binaural renderer; discards high-frequency phase. https://arxiv.org/abs/2501.18224 (accessed 2026-06-07).
  9. Apple — Control Spatial Audio and head tracking (AirPods); Personalized Spatial Audio via iPhone TrueDepth head/ear scan; dynamic head-tracking. https://support.apple.com/guide/airpods/ (accessed 2026-06-07). Production deployment reference.
  10. audioXpress — Dolby Announces Dolby Atmos Personalized Rendering (PHRTF from head/ear photos via cloud). https://audioxpress.com/ (accessed 2026-06-07). Production deployment reference.
  11. SSA Plugins — What Is… Higher Order Ambisonics? and Google Resonance Audio HRTFs (order vs channel count; Resonance Audio HRTFs used by YouTube 360). https://www.ssa-plugins.com/ (accessed 2026-06-07).
  12. Peer-reviewed localization studies on first vs higher-order ambisonic reproduction — localization accuracy improves progressively across orders 1, 3, 5 (AES/academic work). Accessed 2026-06-07.

Per source-hierarchy discipline: where popular write-ups described the ITD/ILD crossover loosely, this article follows the duplex-theory standard value (~1,500 Hz) and the published pinna-notch band (4–9 kHz), flagging that exact numbers vary by individual. Where YouTube help pages and third-party blogs disagreed on accepted formats, the YouTube spatial-audio specification (Ref. 2) governs.