From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training
About
Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient intra-group variation, which enables more reliable reward gradient estimation in GRPO. As training progresses, the method transitions into the exploitation phase, where token-level entropy minimization encourages the model to produce confident and deterministic outputs, thereby consolidating acquired knowledge and refining prediction accuracy. Empirically, across three MLLM backbones - Qwen2-VL-2B, Qwen2-VL-7B, and Qwen2.5-VL-3B - spanning diverse noise settings and multiple tasks, our phased strategy consistently outperforms prior approaches by unifying and enhancing external, internal, and entropy-based methods, delivering robust and superior performance across the board.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K (test) | Accuracy80.6 | 900 | |
| GUI Grounding | ScreenSpot Pro | Accuracy21.3 | 163 | |
| GUI Grounding | ScreenSpot | Avg Acc83.6 | 133 | |
| GUI Grounding | OSWorld-G | Average Score42.4 | 107 | |
| GUI Grounding | ScreenSpot (test) | Element Accuracy83.6 | 42 | |
| Fine grained classification | Pets (test) | Accuracy70 | 29 | |
| GUI Grounding | MMBench-GUI-L2 | Accuracy60.6 | 22 | |
| Open-vocabulary object detection | COCO (subset) | mAP@0.519.47 | 13 | |
| GUI Grounding | MMBench-GUI L2 (in-domain) | Accuracy55 | 13 |