Unhackable Temporal Rewarding for Scalable Video MLLMs

About

In the pursuit of superior video-processing MLLMs, we have encountered a perplexing paradox: the "anti-scaling law", where more data and larger models lead to worse performance. This study unmasks the culprit: "temporal hacking", a phenomenon where models shortcut by fixating on select frames, missing the full video narrative. In this work, we systematically establish a comprehensive theory of temporal hacking, defining it from a reinforcement learning perspective, introducing the Temporal Perplexity (TPL) score to assess this misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework to mitigate the temporal hacking. Both theoretically and empirically, TPL proves to be a reliable indicator of temporal modeling quality, correlating strongly with frame activation patterns. Extensive experiments reveal that UTR not only counters temporal hacking but significantly elevates video comprehension capabilities. This work not only advances video-AI systems but also illuminates the critical importance of aligning proxy rewards with true objectives in MLLM development.

En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, Wenbing Tao• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy58.8	635
Video Understanding	VideoMME	--	222
Video Understanding	MVBench (test)	Accuracy58.8	201
Video Understanding	Video-MME without subtitles	Overall Score52.6	145
Video Reasoning	Video-MME	Accuracy52.6	73
Video Reasoning	MVBench	MVBench Score58.8	56
Video Reasoning	TempCompass	Score59.7	46
Video Understanding	VideoMME (test)	Overall Score52.6	45
General Video Understanding	TempCompass	Accuracy59.7	43
General Video Understanding	MVBench Overall	Accuracy58.8	39

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord