Spatial Audio in VR and AR Streaming

Why this matters

If you are scoping a VR training app, a 360° live stream, an AR navigation feature, or a "walk around the virtual showroom" experience, the audio decision you make in week one sets your bandwidth bill, your latency budget, and whether users believe the world is real. This article is for a product manager, founder, or operations lead who has to brief an engineering team or judge a vendor's "immersive audio" pitch — and who has never heard of a degree of freedom. A senior audio engineer will read it too and must find every number accurate; the channel counts, bitrate ranges, and standard revisions here trace to ISO, 3GPP, W3C, and vendor engineering docs, not marketing pages. By the end you will know exactly when to spend on 6DoF, which SDK fits your platform, and the three mistakes that make spatial audio collapse into a flat blob inside the listener's skull.

The one distinction that drives every decision: 3DoF versus 6DoF

Everything in VR and AR audio starts with counting how a head can move. A "degree of freedom" is one independent way the head can change position or orientation. There are exactly six.

The first three are rotations — the ways you can move your head without leaving your chair. Turning left and right is yaw. Nodding up and down is pitch. Tilting ear-toward-shoulder is roll. A system that tracks only these three is called 3DoF, for three degrees of freedom. It knows which way you are facing, but not where you are standing.

The other three are translations — moving your whole body through space. Stepping left or right is one axis, forward or back is the second, crouching or standing is the third. A system that tracks all six — the three rotations plus the three translations — is called 6DoF. It knows where you are and which way you are looking.

Here is why this single distinction governs the whole project. With 3DoF, a sound source stays at the same distance from you no matter what, because you cannot walk toward it. The system only has to rotate the sound field around your head when you turn. That is a cheap operation. With 6DoF, walking toward a sound must make it louder and its direction must shift, walking past it must send it behind you, and a wall between you and it must muffle it. That is a physics simulation, and it is expensive.

A concrete rule, drawn directly from the standards work: where the video gives the user 6DoF, the audio must give 6DoF too, or the experience breaks. The MPEG-I requirements state it plainly — if the picture lets you walk around but the sound does not follow, you hear a talker from a different direction than where you see them, which is more distracting than no spatial audio at all. Match the audio's degrees of freedom to the video's.

Figure 1. The six degrees of freedom. 3DoF tracks the three rotations and rotates the sound field; 6DoF adds the three translations and must recompute distance, direction, and occlusion as the listener moves.

What "spatial audio" actually has to compute

Strip away the marketing and every VR/AR audio engine does the same three jobs in sequence. Understanding them is what lets you read a vendor's feature list and know what is missing.

The first job is placement — deciding where each sound sits relative to the listener, and adjusting it as the listener turns or moves. For headphones, this means turning a direction into the two ear signals that direction would produce, using a measurement of how your head and ears reshape sound called the head-related transfer function, or HRTF. We covered the HRTF, ambisonics, and binaural rendering in depth in ambisonics, HRTF, and binaural rendering; here we only need the result: every spatial-audio engine applies an HRTF to make a flat pair of earbuds sound like a 3D room.

The second job is room acoustics — the reflections and reverberation that tell your brain how big the space is and how far away a sound is. A voice with no reflections sounds dry and "in your head." Add early reflections (the first bounces off nearby walls) and late reverberation (the diffuse tail), and the same voice snaps into a room at a believable distance. Meta's own audio documentation makes the point directly: HRTFs give strong directional cues, but without room effects sounds "often sound dry and lifeless," and reflections are a primary distance cue.

The third job is rendering — collapsing the whole scene down to the two channels your headphones actually play (or, more rarely, to a speaker layout). This is where the ambisonic sound field and the HRTF meet. A clever optimization, used by Resonance Audio, is to mix every sound source into one shared higher-order ambisonic field first, then apply the HRTF once to the whole field instead of once per source. That is why Resonance can spatialize hundreds of sources on a phone: the expensive HRTF step runs a fixed number of times regardless of how many sounds are playing.

Pitfall — shipping placement without room acoustics. The single most common immersive-audio mistake is spatializing sources with an HRTF, hearing correct left-right-up-down, and declaring victory. Without early reflections and reverb, distance is unreadable and everything feels glued to the listener's head. Budget for room acoustics in week one, not as a "phase two" polish item.

The standards that just changed the landscape

For most of the last decade, 6DoF audio had no single standard — each platform did its own thing. That changed in late 2025.

MPEG-I Immersive Audio (ISO/IEC 23090-4:2025)

The big one is ISO/IEC 23090-4:2025, the standard MPEG calls MPEG-I Immersive Audio, published as a full International Standard on 3 November 2025 (a 625-page document; ISO/IEC 23090-4:2025). Its abstract is exact about scope: it "specifies technology that supports the real-time interactive rendering of an immersive virtual or augmented reality audio presentation while permitting the user to have 6DoF movement in the audio scene," and it defines both the metadata for that rendering and a bitstream syntax for efficient storage and streaming.

What it adds over the previous generation matters for anyone choosing a format. The earlier MPEG-H 3D Audio standard supported only 3DoF — head rotation at a fixed "sweet spot." MPEG-I builds on MPEG-H and extends it to full 6DoF: the listener can move through the scene (x, y, z), and the renderer simulates real acoustic physics — reverberation, early reflections, occlusion (a wall muffling a sound), diffraction (sound bending around an edge), and even the Doppler shift of a moving source (MPEG-I Immersive Audio overview, MPEG WG 6). It reached Final Draft International Standard at the 149th MPEG meeting in Geneva in January 2025 and was published that November; a verification-test summary followed at MPEG 153 in January 2026.

IVAS: spatial audio over a phone call (3GPP Release 18)

The second new standard targets real-time communication rather than streamed content. IVAS — Immersive Voice and Audio Services — is a 3GPP codec available from Release 18, built on and backward-compatible with the EVS speech codec already in the speech-codec family. IVAS carries spatial audio over a mobile network: mono, stereo, multichannel, ambisonics, and object-based audio, plus a parametric format called MASA (3GPP IVAS; Fraunhofer IIS IVAS pages).

The number to remember is its bitrate range: 13.2 to 512 kbit/s for immersive modes. A parametric mode can transmit three or four freely-placed sound objects at just 24.4 or 32 kbit/s and reconstruct their spatial positions at the receiver. IVAS binauralizes on the receiving device using head-related impulse responses, supports head-tracking and listener-orientation, and adds room acoustics through binaural room impulse responses or synthesized reflections — which makes it the natural fit for spatial audio in the WebRTC and real-time pipeline once carriers and devices ship it.

The toolkits you will actually use

You will rarely implement HRTF math yourself. You will pick an SDK. Here are the five that matter in 2026, and what each is for.

Google Resonance Audio is the cross-platform, open-source workhorse. It runs on Android, iOS, Unity, Unreal, FMOD, Wwise, and — uniquely useful for streaming — the web via the Web Audio API. It projects all sources into a shared higher-order ambisonic field for cheap per-source cost, supports adjustable ambisonic order, occlusion, near-field effects, and geometry-based reverb (Resonance Audio developer docs, Google). It is the default choice when you need one audio approach across native apps and the browser.

Valve's Steam Audio is the choice when physical accuracy is the point — architectural walkthroughs, high-end VR games. It physically models sound reflections off scene geometry and can render reflections in third-order ambisonics, with optional GPU acceleration (Steam Audio; AMD TrueAudio Next). More simulation, more CPU, sharper realism.

The Meta XR Audio SDK is the native path for Meta Quest. It provides HRTF-based object and ambisonic spatialization plus room-acoustics simulation, with integrations for Unity, Unreal, FMOD, and Wwise (Meta Horizon OS developer docs). Note for planning: as of 2025 the SDK is on feature freeze at version 85.0 — stable and shipping, but not actively gaining features.

Apple's PHASE — the Physical Audio Spatialization Engine — is the native path on Apple platforms including Vision Pro. PHASE lets you describe sound sources as geometric shapes and supports direct-path transmission, early reflections, and late reverb; on visionOS a ReverbComponent with presets blends the room's real acoustics with a chosen reverb based on immersion level (Apple Developer, PHASE/WWDC and visionOS docs). Use it when you are all-in on Apple's spatial-computing stack.

The Web Audio API is the browser-native baseline, and the one streaming teams reach for first. Its PannerNode positions a source in 3D and, set to panningModel: "HRTF", applies HRTF filtering relative to an AudioListener you move and rotate as the head moves (W3C Web Audio API; MDN). It is built into every modern browser, needs no install, and pairs with WebXR for in-browser VR. Independent research found PannerNode-with-HRTF placed targets more accurately than some library alternatives in a browser test — a reminder that the built-in path is genuinely usable, not a fallback.

SDK / API	Platforms	Best for	Room acoustics	Notes
Resonance Audio	Android, iOS, Unity, Unreal, Web	One approach everywhere	Geometry-based reverb	Shared-ambisonic field → cheap per source
Steam Audio	Windows, Unity, Unreal	Physical accuracy, VR games	Physically-modeled reflections	TOA reflections; optional GPU
Meta XR Audio SDK	Meta Quest (Unity/Unreal/FMOD/Wwise)	Native Quest apps	HRTF + room simulation	Feature-frozen at v85.0 (2025)
Apple PHASE	iOS, macOS, visionOS	Apple spatial computing	Direct + reflections + reverb	`ReverbComponent` presets in visionOS
Web Audio API	Every modern browser	Streaming, WebXR, no install	Manual (convolver/reverb)	`PannerNode` `panningModel:"HRTF"`

Table 1. The five spatial-audio toolkits for VR/AR in 2026. Sources: Resonance Audio docs (Google); Steam Audio (Valve/AMD); Meta Horizon OS docs; Apple Developer (PHASE, visionOS); W3C Web Audio API + MDN.

The bandwidth question: what spatial audio costs to stream

Spatial audio is not free on the wire, and the cost depends entirely on the format you transmit.

For streamed 360° video the common path is first-order ambisonics, abbreviated FOA — four channels (W, Y, Z, X). YouTube's spatial-audio specification, for example, accepts FOA as a 4-channel track at 48 kHz, optionally plus a 2-channel head-locked stereo track for narration or music that should not rotate with the head (covered in detail in ambisonics, HRTF, and binaural rendering). Four channels is the floor for "sound from any direction"; third-order ambisonics (16 channels) gives sharper localization at four times the channel count.

Let us put a number on it. Take FOA encoded with the Opus codec at a modest per-channel rate:

FOA channels      = 4
per-channel rate  = 64 kbps
total             = 4 × 64 = 256 kbps

Now the third-order case for the same per-channel rate:

TOA channels      = 16
per-channel rate  = 64 kbps
total             = 16 × 64 = 1,024 kbps ≈ 1 Mbps

So moving from coarse to sharp spatial resolution is a four-fold jump in audio bandwidth, from roughly a quarter-megabit to a full megabit per second. For a live 360° stream served to thousands of viewers, that audio tax is real money — the same multi-track storage and delivery math we work through in storage and CDN math for audio. For real-time communication, IVAS's parametric object mode is the counterpoint: three or four placed objects at 24.4–32 kbit/s, because it transmits parameters rather than full channels.

The planning rule: stream FOA (≈256 kbps) unless your content genuinely needs higher-order precision, and reserve 6DoF object-based or MPEG-I delivery for experiences where the user can move.

A decision you can make in five minutes

Put the pieces together and the choice is usually quick. Ask three questions in order.

First: can the user physically move through the scene, or only look around? If they can only look — seated VR, a 360° video, a fixed AR overlay — you need 3DoF, and FOA over a standard codec is enough. If they can walk — room-scale VR, a Quest game, a "tour the building" AR app — you need 6DoF, and you are in object-based or MPEG-I Immersive Audio territory.

Second: what platform are you on? Browser and cross-platform → Web Audio API or Resonance Audio. Meta Quest → Meta XR Audio SDK. Apple Vision Pro → PHASE. Physically-accurate VR game → Steam Audio. Phone-call-grade real-time spatial voice → IVAS, once devices support it.

Third: do you need distance to be believable? If yes — and for any walk-around experience the answer is yes — you must enable room acoustics (early reflections plus reverb), not just HRTF placement. This is the step teams skip and then wonder why their audio feels flat.

Figure 2. A five-minute decision tree. Movement decides 3DoF versus 6DoF; platform decides the SDK; distance realism decides whether room acoustics are mandatory.

Where Fora Soft fits in

Fora Soft has built video streaming, WebRTC conferencing, and AR/VR software since 2005, and audio is where immersive projects most often go wrong before launch. In VR training and 360° streaming work, the recurring task is matching the audio's degrees of freedom to the video's — 3DoF audio for a 360° lecture, full positional audio for a walk-around simulation — and wiring head-tracking through cleanly so a turned head actually moves the sound. In browser-based and conferencing contexts we lean on the Web Audio API and Resonance Audio because they ship everywhere without a plugin; in native Quest and Vision Pro work we use the platform SDKs. The single most valuable thing we do early is verify the end-to-end head-tracking and room-acoustics chain on real hardware, because that is the failure mode users notice in the first ten seconds.

Common mistakes that break immersive audio

Three recurring failures account for most "the spatial audio feels broken" reports.

The first is mismatched degrees of freedom — 6DoF video with 3DoF audio. The user walks toward an object, the picture grows, the sound does not move, and the brain rejects the whole scene. Match audio DoF to video DoF.

The second is no room acoustics — HRTF placement with no reflections or reverb. Direction is right but distance is unreadable and everything sounds "in your head." Add early reflections and late reverb.

The third is broken head-tracking — the renderer is correct but the head pose never reaches it, or reaches it with too much latency, so turning your head does not move the world. Always test the full chain on the target device, not just the renderer in isolation. (For the related lip-sync timing problem in WebRTC, see lip-sync in WebRTC.)

Call to action

Talk to a audio engineer — book a 30-minute scoping call to talk through your spatial audio in vr and ar plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the VR/AR Spatial Audio Decision Sheet — One-page sheet: 3DoF vs 6DoF, the SDK by platform, the FOA/TOA bitrate math, the MPEG-I and IVAS standards, and a pre-ship checklist.

References

ISO/IEC 23090-4:2025, Information technology — Coded representation of immersive media — Part 4: MPEG-I immersive audio. Published 2025-11-03, Edition 1, 625 pp. Abstract: real-time interactive 6DoF rendering of immersive VR/AR audio, with metadata and bitstream syntax for streaming. https://www.iso.org/standard/84711.html — tier-1 primary standard. The full normative text is paywalled (CHF 227); scope and lifecycle read directly from the ISO catalogue page.
MPEG WG 6 — MPEG-I Immersive Audio overview, mpeg.org. Confirms MPEG-I extends MPEG-H 3D Audio from 3DoF to 6DoF; lists modeled acoustics: reverberation, early reflections, occlusion, diffraction, Doppler. FDIS at MPEG 149 (Geneva, Jan 2025); verification-test summary at MPEG 153 (Jan 2026). https://www.mpeg.org/standards/MPEG-I/4/ — tier-1 standards-body source.
3GPP — IVAS (Immersive Voice and Audio Services), technologies pages. Release 18 codec, built on EVS; formats mono/stereo/multichannel/ambisonics/object/MASA. https://www.3gpp.org/technologies/ivas-2023 — tier-1 standards-body source.
Fraunhofer IIS — Immersive Voice and Audio Services (IVAS). Bitrate range 13.2–512 kbit/s; parametric object coding of 3–4 objects at 24.4/32 kbit/s; receiver-side binauralization with head-tracking and room acoustics. https://www.iis.fraunhofer.de/en/ff/amm/communication/ivas.html — tier-3 first-party (codec maker).
W3C — Web Audio API (Recommendation). PannerNode, panningModel: "HRTF", AudioListener position/orientation. https://www.w3.org/TR/webaudio/ — tier-1 W3C Recommendation.
MDN Web Docs — Web audio spatialization basics. Practical PannerNode/AudioListener spatialization, HRTF panning model. https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API/Web_audio_spatialization_basics — tier-6 educational, used for orientation only.
Google — Resonance Audio developer documentation. Cross-platform (incl. Web); shared higher-order ambisonic field applies HRTF once to the field; adjustable order; occlusion, near-field, geometry-based reverb. https://resonance-audio.github.io/resonance-audio/develop/overview.html — tier-4 first-party deployer.
Meta — Horizon OS / Meta XR Audio SDK documentation. HRTF object + ambisonic spatialization, room acoustics; reflections/reverb as distance cues; feature freeze at v85.0 (2025). https://developers.meta.com/horizon/documentation/unity/meta-xr-audio-sdk-features/ — tier-4 first-party deployer.
Apple Developer — PHASE (Physical Audio Spatialization Engine), WWDC21; visionOS spatial audio / RealityKit reverb. Direct path, early reflections, late reverb; ReverbComponent presets blend real and preset acoustics by immersion level. https://developer.apple.com/videos/play/wwdc2021/10079/ — tier-4 first-party deployer.
Valve / AMD GPUOpen — Steam Audio with TrueAudio Next. Physically-modeled reflections; third-order ambisonics reflections with optional GPU acceleration. https://gpuopen.com/news/beyond-spatial-audio-trueaudio-next-acceleration-steam-audio-sound-reflections-third-order-ambisonics-demo-video/ — tier-4 first-party deployer.
AES69 / SOFA Conventions — Spatially Oriented Format for Acoustics; HRTF interchange (AES69-2015, reaffirmed 2020/2022). Context for HRTF portability across these SDKs. https://www.sofaconventions.org/ — tier-1 AES standard (context).
"How to Spatial Audio with the WebXR API," IEEE conference paper, 2023 — comparative browser test in which PannerNode-with-HRTF outperformed some library alternatives on target-identification accuracy. https://ieeexplore.ieee.org/document/10289525/ — tier-5 peer-reviewed.

Spatial Audio in VR and AR Streaming

Why this matters

The one distinction that drives every decision: 3DoF versus 6DoF

What "spatial audio" actually has to compute

The standards that just changed the landscape

MPEG-I Immersive Audio (ISO/IEC 23090-4:2025)

IVAS: spatial audio over a phone call (3GPP Release 18)

The toolkits you will actually use

The bandwidth question: what spatial audio costs to stream

A decision you can make in five minutes

Where Fora Soft fits in

Common mistakes that break immersive audio

What to read next

Call to action

References

Related glossary terms

Spatial Audio in VR and AR Streaming

Why this matters

The one distinction that drives every decision: 3DoF versus 6DoF

What "spatial audio" actually has to compute

The standards that just changed the landscape

MPEG-I Immersive Audio (ISO/IEC 23090-4:2025)

IVAS: spatial audio over a phone call (3GPP Release 18)

The toolkits you will actually use

The bandwidth question: what spatial audio costs to stream

A decision you can make in five minutes

Where Fora Soft fits in

Common mistakes that break immersive audio

What to read next

Call to action

References

Related glossary terms

HRTF

Ambisonics

Channel

Binaural rendering

Bitrate

MPEG-H 3D Audio

Stereo

Lip-sync