Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Watch Before You Answer: Learning from Visually Grounded Post-Training

About

It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.

Yuxuan Zhang, EunJeong Hwang, Huaisong Zhang, Penghui Du, Yiming Jia, Dongfu Jiang, Xuan He, Shenhui Zhang, Ping Nie, Peter West, Kelsey R. Allen• 2026

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMMVU
Accuracy65.8
76
Video UnderstandingVideoMME--
60
Video UnderstandingVideoMMMU
Accuracy49.4
59
Video UnderstandingAggregate Video Benchmarks Suite
Overall Average Score59.5
28
Video UnderstandingVisually Grounded (VG) question
Accuracy47.9
21
Showing 5 of 5 rows

Other info

GitHub

Follow for update