FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
About
We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Camera-controlled Video Generation | Ava-256 Static camera setting | PSNR15.85 | 4 | |
| Controllable Portrait Video Generation | In-the-wild videos | Camera Correctness100 | 4 |