Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing

About

Text-based role-playing models can imitate character styles, yet they often fail to reflect a scene's atmosphere and evolving tension, both essential for immersive applications such as Virtual Reality (VR) games and interactive narratives. We study video-grounded role-playing dialogue and introduce EBM-RL (Eye-Brain-Mouth Reinforcement Learning), a decoupled GRPO-based framework that explicitly separates observation ([perception]), reasoning ([think]), and utterance ([answer]). This structure promotes human-like sensory grounding by compelling the model to first attend to visual cues, then form internal interpretations, and finally generate context-appropriate dialogue. EBM-RL integrates four complementary rewards: (i) CLIP-based scene-text alignment to improve ambiance and emotion; (ii) a Perceptual-Cognitive reward that encourages [perception] and [think] processes that increase the likelihood of the reference response; (iii) answer accuracy to ensure faithfulness; and (iv) a dense format reward to enforce the desired structured output. Extensive experiments demonstrate that EBM-RL substantially outperforms text-only role-playing baselines and larger-scale vision-language models on our immersive role-playing benchmark, delivering simultaneous gains in visual-atmosphere consistency and character authenticity. Beyond the role-playing domain, EBM-RL also exhibits strong zero-shot generalization: without any additional fine-tuning, it consistently improves performance on out-of-domain VideoQA benchmarks. We additionally release an open-source dataset for video-grounded role-playing dialogue.

Miao Wang, Yuling Shi, Yijiang Li, Yeheng Chen, Xiaodong Gu, Bin Li, Bo Gao, Yaduan Ruan• 2026

Related benchmarks

TaskDatasetResultRank
Video-grounded Role-playingVideo-grounded Role-playing Dataset Movie Scripts
VEG74.25
11
Role-Playing Evaluation (Conversational-Naturalness)CN
Win Rate65
9
Role-Playing Evaluation (Social-Personality-Consistency)SPC
Win Rate (SPC)66
9
Role-Playing Evaluation (Visual-Element-Groundedness)VEG
Win Rate65
9
Video Question AnsweringNExT-QA (OOD)
CH (Accuracy)72.63
2
Video Question AnsweringPororoQA 2k samples
Accuracy51.45
2
Video Question AnsweringActivityNet-QA Y/N
Accuracy79.9
2
Showing 7 of 7 rows

Other info

Follow for update