Reinforcement Learning with Robust Rubric Rewards
About
While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMStar | Accuracy76.9 | 407 | |
| Chart Question Answering | ChartQA | Accuracy91.1 | 371 | |
| Visual Question Answering | RealworldQA | Accuracy77.6 | 259 | |
| Mathematical Reasoning | WeMath | Accuracy74.3 | 225 | |
| Visual Question Answering | SimpleVQA | Accuracy0.546 | 164 | |
| Mathematical Reasoning | MathVista mini | Accuracy83.9 | 135 | |
| Mathematical Reasoning | DynaMath | Accuracy78.4 | 127 | |
| Document Visual Question Answering | InfoVQA | Accuracy0.902 | 85 | |
| Mathematical Reasoning | MathVerse mini | Accuracy79.6 | 83 | |
| Visual Question Answering | countbenchqa | Accuracy93.2 | 37 |