See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
About
Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MathVista | Score75 | 322 | |
| Chart Understanding and Reasoning | CharXiv | Score50.6 | 15 | |
| Chart Understanding and Reasoning | Evochart | Score68.7 | 14 | |
| Chart Understanding and Reasoning | ChartMuseum | Score34 | 13 | |
| Chart Understanding and Reasoning | ChartQAPro | Score51.9 | 12 | |
| General Perception and Reasoning | MathVerse VO | Score45.3 | 11 | |
| General Perception and Reasoning | MMStar | Score65.7 | 11 | |
| General Perception and Reasoning | MathVision | Score28.6 | 10 |