MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion

About

We introduce Multi-view Ancestral Sampling (MAS), a method for 3D motion generation, using 2D diffusion models that were trained on motions obtained from in-the-wild videos. As such, MAS opens opportunities to exciting and diverse fields of motion previously under-explored as 3D data is scarce and hard to collect. MAS works by simultaneously denoising multiple 2D motion sequences representing different views of the same 3D motion. It ensures consistency across all views at each diffusion step by combining the individual generations into a unified 3D sequence, and projecting it back to the original views. We demonstrate MAS on 2D pose data acquired from videos depicting professional basketball maneuvers, rhythmic gymnastic performances featuring a ball apparatus, and horse races. In each of these domains, 3D motion capture is arduous, and yet, MAS generates diverse and realistic 3D sequences. Unlike the Score Distillation approach, which optimizes each sample by repeatedly applying small fixes, our method uses a sampling process that was constructed for the diffusion framework. As we demonstrate, MAS avoids common issues such as out-of-domain sampling and mode-collapse. https://guytevet.github.io/mas-page/

Roy Kapon, Guy Tevet, Daniel Cohen-Or, Amit H. Bermano• 2023

Related benchmarks

Task	Dataset	Result
Text-to-motion generation	HumanML3D (test)	FID22.056	576
3D Human Motion Recovery	AIST++	PA-MPJPE155.6	9
Human Pose Lifting	Steezy	J2D Accuracy106.4	6
Human Pose Lifting	AIST++	MPJPE191.1	6
Human Pose Lifting	NicoleMove	J2D Error100.1	6
Unconditional 3D human motion generation	NBA	FID5.38	5
3D Motion Generation	Human3.6M (All)	FID15.15	4
3D Motion Generation	NBA All view angles	FID5.38	3
3D Motion Generation	Human3.6M (Side)	FID11.94	3
Animal Pose Lifting	CatPlay	J2D180.6	3

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord