SceneScape: Text-Driven Consistent Scene Generation

About

We present a method for text-driven perpetual view generation -- synthesizing long-term videos of various scenes solely, given an input text prompt describing the scene and camera poses. We introduce a novel framework that generates such videos in an online fashion by combining the generative power of a pre-trained text-to-image model with the geometric priors learned by a pre-trained monocular depth prediction model. To tackle the pivotal challenge of achieving 3D consistency, i.e., synthesizing videos that depict geometrically-plausible scenes, we deploy an online test-time training to encourage the predicted depth map of the current frame to be geometrically consistent with the synthesized scene. The depth maps are used to construct a unified mesh representation of the scene, which is progressively constructed along the video generation process. In contrast to previous works, which are applicable only to limited domains, our method generates diverse scenes, such as walkthroughs in spaceships, caves, or ice castles.

Rafail Fridman, Amit Abecasis, Yoni Kasten, Tali Dekel• 2023

Related benchmarks

Task	Dataset	Result
3D Scene Generation	WorldScore	Camera Control84.99	33
Panorama Generation	Matterport3D (test)	FID42.13	15
3D World Generation	3D World Generation	--	7
3D World Generation	Generated 3D Worlds Rotation & Translation Trajectory	BRISQUE55.91	5
3D World Generation	Generated 3D Worlds Rotation Trajectory	BRISQUE52.5	5
3D World Generation	Generated 3D Worlds Translation Trajectory	BRISQUE44.32	5
Text-driven perpetual scene generation	RealEstate10K indoor videos vs VideoFusion comparison scale	Reprojection Error (px)0.29	2
Text-driven perpetual scene generation	RealEstate10K indoor videos vs GEN-1 comparison scale filtered subset (110 videos)	Rotation Error (deg)1.71	2

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord