From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction
About
In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature fusion. Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences. Extensive experiments show that we achieve state-of-the-art performance on THuman2.1 and Hi4D and generalize effectively to in-the-wild videos.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Matting | P3M-500-NP | SAD (Trimap)11.88 | 27 | |
| Surface Normal Estimation | Hi4D | MAE15 | 18 | |
| Image Matting | P3M-500-P | SAD11.63 | 16 | |
| Depth Estimation | Hi4D (test) | RMSE0.07 | 15 | |
| Depth Estimation | THuman Face 2.1 (test) | RMSE0.0147 | 15 | |
| Depth Estimation | THuman UpperBody 2.1 (test) | RMSE0.0174 | 15 | |
| Depth Estimation | THuman FullBody 2.1 (test) | RMSE0.0218 | 15 | |
| Video Depth Estimation | Hi4D | OPW0.007 | 13 | |
| Surface Normal Estimation | THuman 2.1 | Mean Angular Error16 | 10 | |
| Human Matting | PPM-100 | SAD70.71 | 6 |