Make Your Training Flexible: Towards Deployment-Efficient Video Models

About

Popular video training methods mainly operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid, resulting in sub-optimal accuracy-computation trade-offs due to inherent video redundancy. They also lack adaptability to varying computational budgets for downstream tasks, hindering applications of the most competitive model in real-world scenes. We thus propose a new test setting, Token Optimization, for maximized input information across budgets, which optimizes the size-limited set of input tokens through token selection from more suitably sampled videos. To this end, we propose a novel augmentation tool termed Flux. By making the sampling grid flexible and leveraging token selection, it is easily adopted in most popular video training frameworks, boosting model robustness with nearly no additional cost. We integrate Flux in large-scale video pre-training, and the resulting FluxViT establishes new state-of-the-art results across extensive tasks at standard costs. Notably, with 1/4 tokens only, it can still match the performance of previous state-of-the-art models with Token Optimization, yielding nearly 90\% savings. All models and data are available at https://github.com/OpenGVLab/FluxViT.

Chenting Wang, Kunchang Li, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, Limin Wang• 2025

Related benchmarks

Task	Dataset	Result
Action Recognition	Something-Something v2 (val)	Top-1 Accuracy75.6	545
Action Recognition	Kinetics-400	Top-1 Acc90	498
Text-to-Video Retrieval	DiDeMo	R@10.535	465
Text-to-Video Retrieval	MSVD	R@154.2	290
Text-to-Video Retrieval	ActivityNet	R@10.567	245
Text-to-Video Retrieval	LSMDC	R@125.4	181
Text-to-Video Retrieval	MSRVTT	Recall@149.9	59
Video Classification	COIN (test)	Top-1 Accuracy94.1	20
Fine-grained captioning	Dream1k	F1 Score29.5	11
General spatiotemporal perception	MVBench	Score49	11

Showing 10 of 10 rows

Other info

Code

Follow for update

@wizwand_team Discord