Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement

About

Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such experiences can be achieved through computer-generated content, constructing them directly from real-world captured videos remains largely unexplored. We introduce Immersive Volumetric Videos, a new volumetric media format designed to provide large 6-DoF interaction spaces, audiovisual feedback, and high-resolution, high-frame-rate dynamic content. To support IVV construction, we present ImViD, a multi-view, multi-modal dataset built upon a space-oriented capture philosophy. Our custom capture rig enables synchronized multi-view video-audio acquisition during motion, facilitating efficient capture of complex indoor and outdoor scenes with rich foreground--background interactions and challenging dynamics. The dataset provides 5K-resolution videos at 60 FPS with durations of 1-5 minutes, offering richer spatial, temporal, and multimodal coverage than existing benchmarks. Leveraging this dataset, we develop a dynamic light field reconstruction framework built upon a Gaussian-based spatio-temporal representation, incorporating flow-guided sparse initialization, joint camera temporal calibration, and multi-term spatio-temporal supervision for robust and accurate modeling of complex motion. We further propose, to our knowledge, the first method for sound field reconstruction from such multi-view audiovisual data. Together, these components form a unified pipeline for immersive volumetric video production. Extensive benchmarks and immersive VR experiments demonstrate that our pipeline generates high-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces. This work provides both a foundational definition and a practical construction methodology for immersive volumetric videos.

Zhengxian Yang, Shengqi Wang, Shi Pan, Hongshuai Li, Haoxiang Wang, Lin Li, Guanjun Li, Zhengqi Wen, Borong Lin, Jianhua Tao, Tao Yu• 2026

Related benchmarks

Task	Dataset	Result
Dynamic Light Field Reconstruction	ImViD 300 frames per scene (test)	PSNR33.51	25
Dynamic Light Field Reconstruction	Google Immersive (test)	PSNR32.48	20
Dynamic Light Field Reconstruction	MPEG-GSC views (test)	PSNR36.16	15
Dynamic Light Field Reconstruction	MeetRoom (Discussion)	PSNR35.01	5
Dynamic Light Field Reconstruction	MeetRoom Trimming	PSNR33.33	5
Dynamic Light Field Reconstruction	MeetRoom VRheadset	PSNR32.08	5
Dynamic Light Field Reconstruction	MeetRoom Average	PSNR33.47	5
Dynamic Light Field Reconstruction	ImViD Scene 1 Opera (test)	PSNR33.51	5
Dynamic Light Field Reconstruction	ImViD Scene 2 Laboratory (test)	PSNR31.1	5
Dynamic Light Field Reconstruction	ImViD Scene 5 Rendition (test)	PSNR27.84	5

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord