Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

About

Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.

Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMathVista
Score75
322
Chart Understanding and ReasoningCharXiv
Score50.6
15
Chart Understanding and ReasoningEvochart
Score68.7
14
Chart Understanding and ReasoningChartMuseum
Score34
13
Chart Understanding and ReasoningChartQAPro
Score51.9
12
General Perception and ReasoningMathVerse VO
Score45.3
11
General Perception and ReasoningMMStar
Score65.7
11
General Perception and ReasoningMathVision
Score28.6
10
Showing 8 of 8 rows

Other info

GitHub

Follow for update