DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

About

Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement learning algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) remains underexplored. In this paper, we explore GRPO and identify two issues that hinder effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function as a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as clipping and min operations. This directly aligns the model with the advantages, providing guidance to prefer better outputs. The difficulty-aware data augmentation strategy augments input prompts/videos to target solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.

Jinyoung Park, Jeehye Na, Jinyoung Kim, Hyunwoo J. Kim• 2025

Related benchmarks

Task	Dataset	Result
Multi-modal Video Understanding	MVBench	Accuracy49.6	83
Multi-modal Video Understanding	VideoMME	Accuracy51.1	64
Grounded Video Question Answering	NExT-GQA	mIoU36.8	54

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord