TikArt: Stabilizing Aperture-Guided Fine-Grained Visual Reasoning with Reinforcement Learning
About
Fine-grained visual reasoning in multimodal large language models (MLLMs) is bottlenecked by single-pass global image encoding: key evidence often lies in tiny objects, cluttered regions, subtle markings, or dense charts. We present \textbf{TikArt} (\textbf{T}h\textbf{i}n\textbf{k}ing \textbf{A}pe\textbf{rt}ure), an aperture-guided agent that formulates multimodal reasoning as sequential evidence acquisition over regions of interest. TikArt follows a Think--Aperture--Observe (TAO) loop that interleaves language reasoning with two aperture actions: Zoom, which extracts rectangular crops, and Segment, which invokes an off-the-shelf segmenter to produce object-centric mask-based views for irregular targets. A mandatory Observation step after every aperture action writes local evidence back into text, yielding interpretable aperture trajectories and persistent linguistic memory. Built on Qwen3-VL-8B, TikArt is trained with GRPO-style reinforcement learning under a two-stage curriculum. To stabilize long-horizon tool-integrated learning, we introduce Relative Uncertainty Reduction (RUR), a dense reward computed by a frozen evaluator that favors evidence-building trajectories and mitigates degenerate tool use. Experiments on high-resolution reasoning, general multimodal understanding, and both referring and reasoning-oriented segmentation show consistent gains over the backbone, demonstrating that aperture-guided observation improves fine-grained visual reasoning and transfers naturally to pixel-level grounding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMStar | -- | 407 | |
| Reasoning Segmentation | ReasonSeg (test) | gIoU73.8 | 236 | |
| High-resolution perception | HR-Bench-4K | Overall Score82.25 | 103 | |
| Referring Segmentation | RefCOCO (val) | cIoU77.1 | 84 | |
| High-resolution Visual Understanding | HR-Bench-8K | FSP89.25 | 83 | |
| High-resolution perception | V* | Overall Score89.53 | 55 | |
| Document Visual Question Answering | DocVQA v1.0 (test) | -- | 49 | |
| Multimodal Understanding | MME-RealWorld-Lite | Overall Score56.97 | 34 | |
| Tool Use | VerlTool OOD Tools | Attribute292 | 11 | |
| Tool Use | VerlTool IID Tools | Att.291 | 11 |