SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

About

Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts, thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks-including object editing, novel view synthesis, and camera pose estimation-within a unified framework without requiring additional complex pipelines. We validate SplatFlow's capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.

Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, Changick Kim• 2024

Related benchmarks

Task	Dataset	Result
Text-to-3D Generation	T³Bench Single Object with Surroundings	BRISQUE16.8	14
Text-to-3D Generation	DPG-Bench 1.0 (test)	Global Score69.7	7
Text-to-3D Generation	SceneBench Scene-level	Imaging Score48.85	7
Text-to-3D Generation	T3Bench Object-centric	Imaging Score46.09	7
Camera pose estimation	MVImgNet (val)	Rotation Acc @5 deg28.8	5
Text-to-3D Generation	User Study	Text Alignment Rank3.38	5
Text-to-3DGS Generation	MVImgNet	FID-10K34.85	4
Text-to-3DGS Generation	DL3DV	FID (2.4K)79.91	4
3D object replacement	MVImgNet (val)	CLIPScore31.3	3

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord