APPO: Attention-guided Perception Policy Optimization for Video Reasoning

About

Complex video reasoning, actually, relies excessively on fine-grained perception rather than on expert (e.g., Ph.D, Science)-level reasoning. Through extensive empirical observation, we have recognized the critical impact of perception. In particular, when perception ability is almost fixed, enhancing reasoning from Qwen3-8B to OpenAI-o3 yields only 0.7% performance improvement. Conversely, even minimal change in perception model scale (from 7B to 32B) boosts performance by 1.4%, indicating enhancing perception, rather than reasoning, is more critical to improve performance. Therefore, exploring how to enhance perception ability through reasoning without the need for expensive fine-grained annotation information is worthwhile. To achieve this goal, we specially propose APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model's fine-grained perception. The core idea behind APPO is to optimize those tokens from different responses that primarily focus on the same crucial video frame (called intra-group perception tokens). Experimental results on diverse video benchmarks and models with different scales (3/7B) demonstrate APPO consistently outperforms GRPO and DAPO (0.5%~4%). We hope our work provides a promising approach to effectively enhance model's perception abilities through reasoning in a low-cost manner, serving diverse scenarios and demands.

Henghui Du, Chang Zhou, Xi Chen, Di Hu• 2026

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy64.6	563
Multi-modal Video Understanding	MVBench	Accuracy64.6	83
Video Perception	Perception (test)	Accuracy66.9	57
Grounded Video Question Answering	NExT-GQA	mIoU32.9	54
Video Reasoning	Seed-Bench R1	Average Answer Score50.5	26
Video Reasoning	SEED-Bench-R1 L1 In-Dist.	Accuracy50.5	16
Video Reasoning	SEED-Bench L2 OOD R1	Accuracy51.6	16
Video Reasoning	SEED-Bench L3 OOD R1	Accuracy49.3	16
Video Scene Identification	VSI-Bench	Accuracy38.2	10
Video Scene Interaction	VSI-Bench	Accuracy32.7	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord