Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

About

Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce $\textbf{AVATAR}$ ($\textbf{A}$udio-$\textbf{V}$ideo $\textbf{A}$gen$\textbf{t}$ for $\textbf{A}$lignment and $\textbf{R}$easoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a credit assignment strategy that emphasizes early (planning) and late (synthesis) reasoning phases. $\textbf{AVATAR}$ achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by $\mathbf{+5.4}$ on MMVU, $\mathbf{+4.9}$ on OmniBench, and $\mathbf{+4.5}$ on Video-Holmes. Furthermore, it surpasses standard GRPO by $\mathbf{+3.7}$ on OmniBench and $\mathbf{+1.9}$ on Video-Holmes, while demonstrating $\textbf{$5$$\times$ sample efficiency}$, requiring $80\%$ fewer generated completions to reach target performance.

Yogesh Kulkarni, Pooyan Fazli• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench--
425
Mathematical Multimodal ReasoningMathVerse
Accuracy49.6
221
Mathematical Multimodal ReasoningMathVista
Accuracy70.4
218
Video UnderstandingVideo-MME
Overall Score62.8
92
Audio-visual understandingDailyOmni
Average Score55.7
69
Video UnderstandingLVBench
Average Score38.4
67
Audio-visual understandingWorldSense
Accuracy46
42
Video ReasoningVideo-Holmes
Score45.1
34
Multimodal Math ReasoningMMK12
Accuracy57.8
24
Audio-visual understandingIntentBench
Accuracy63.9
20
Showing 10 of 15 rows

Other info

Follow for update