Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

APPO: Attention-guided Perception Policy Optimization for Video Reasoning

About

Complex video reasoning, actually, relies excessively on fine-grained perception rather than on expert (e.g., Ph.D, Science)-level reasoning. Through extensive empirical observation, we have recognized the critical impact of perception. In particular, when perception ability is almost fixed, enhancing reasoning from Qwen3-8B to OpenAI-o3 yields only 0.7% performance improvement. Conversely, even minimal change in perception model scale (from 7B to 32B) boosts performance by 1.4%, indicating enhancing perception, rather than reasoning, is more critical to improve performance. Therefore, exploring how to enhance perception ability through reasoning without the need for expensive fine-grained annotation information is worthwhile. To achieve this goal, we specially propose APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model's fine-grained perception. The core idea behind APPO is to optimize those tokens from different responses that primarily focus on the same crucial video frame (called intra-group perception tokens). Experimental results on diverse video benchmarks and models with different scales (3/7B) demonstrate APPO consistently outperforms GRPO and DAPO (0.5%~4%). We hope our work provides a promising approach to effectively enhance model's perception abilities through reasoning in a low-cost manner, serving diverse scenarios and demands.

Henghui Du, Chang Zhou, Xi Chen, Di Hu• 2026

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy64.6
425
Multi-modal Video UnderstandingMVBench--
70
Video PerceptionPerception (test)
Accuracy66.9
57
Grounded Video Question AnsweringNExT-GQA
mIoU32.9
44
Video ReasoningSeed-Bench R1
Average Answer Score50.5
26
Video ReasoningSEED-Bench-R1 L1 In-Dist.
Accuracy50.5
16
Video ReasoningSEED-Bench L2 OOD R1
Accuracy51.6
16
Video ReasoningSEED-Bench L3 OOD R1
Accuracy49.3
16
Video Scene IdentificationVSI-Bench
Accuracy38.2
10
Video Scene InteractionVSI-Bench
Accuracy32.7
6
Showing 10 of 10 rows

Other info

Follow for update