Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models

About

Visual Question Answering (VQA) often requires coupling fine-grained perception with factual knowledge beyond the input image. Prior multimodal Retrieval-Augmented Generation (MM-RAG) systems improve factual grounding but lack an internal policy for when and how to retrieve. We propose PixSearch, the first end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning. During encoding, PixSearch emits <search> tokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries, eliminating the reliance on modular pipelines (detectors, segmenters, captioners, etc.). A two-stage supervised fine-tuning regimen with search-interleaved supervision teaches retrieval timing and query selection while preserving segmentation ability. On egocentric and entity-centric VQA benchmarks, PixSearch substantially improves factual consistency and generalization, yielding a 19.7% relative gain in accuracy on CRAG-MM compared to whole image retrieval, while retaining competitive reasoning performance on various VQA and text-only QA tasks.

Jeonghwan Kim, Renjie Tao, Sanat Sharma, Jiaqi Wang, Kai Sun, Zhaojiang Lin, Seungwhan Moon, Lambert Mathias, Anuj Kumar, Heng Ji, Xin Luna Dong• 2026

Related benchmarks

TaskDatasetResultRank
Referring Expression SegmentationRefCOCO (testA)--
217
Referring Expression SegmentationRefCOCO+ (val)--
201
Referring Expression SegmentationRefCOCO (testB)--
191
Referring Expression SegmentationRefCOCO (val)--
190
Referring Expression SegmentationRefCOCO+ (testA)--
190
Referring Expression SegmentationRefCOCOg (val (U))--
89
Question AnsweringPopQA
EM31.6
80
Question AnsweringHotpotQA
EM26.2
79
Multimodal Retrieval-Augmented GenerationCRAG-MM (Overall)
Truthfulness20.5
18
Multimodal Retrieval-Augmented GenerationCRAG-MM Egocentric
Truthfulness-11.7
9
Showing 10 of 16 rows

Other info

Follow for update