Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

About

Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.

Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, Huan Wang• 2025

Related benchmarks

TaskDatasetResultRank
Visual ReasoningREASONMAP-PLUS
Weighted Accuracy74.25
16
High-Resolution Visual ReasoningHRBench
Accuracy0.7125
16
Visual ReasoningREASONMAP Short questions
Weighted Accuracy0.3151
16
Visual ReasoningREASONMAP Long questions
Weighted Accuracy31.77
16
General TaskChartQA
Accuracy0.8724
8
Fine-grained Visual ReasoningV*
Accuracy80.1
8
General TaskMMStar
Accuracy62.27
8
Spatial ReasoningSEED-Bench-2-Plus
Accuracy61.96
7
Spatial ReasoningSpatialEval--
6
Showing 9 of 9 rows

Other info

Follow for update