Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models

About

Visual Question Answering (VQA) often requires coupling fine-grained perception with factual knowledge beyond the input image. Prior multimodal Retrieval-Augmented Generation (MM-RAG) systems improve factual grounding but lack an internal policy for when and how to retrieve. We propose PixSearch, the first end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning. During encoding, PixSearch emits <search> tokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries, eliminating the reliance on modular pipelines (detectors, segmenters, captioners, etc.). A two-stage supervised fine-tuning regimen with search-interleaved supervision teaches retrieval timing and query selection while preserving segmentation ability. On egocentric and entity-centric VQA benchmarks, PixSearch substantially improves factual consistency and generalization, yielding a 19.7% relative gain in accuracy on CRAG-MM compared to whole image retrieval, while retaining competitive reasoning performance on various VQA and text-only QA tasks.

Jeonghwan Kim, Renjie Tao, Sanat Sharma, Jiaqi Wang, Kai Sun, Zhaojiang Lin, Seungwhan Moon, Lambert Mathias, Anuj Kumar, Heng Ji, Xin Luna Dong• 2026

Related benchmarks

TaskDatasetResultRank
Referring Expression SegmentationRefCOCO (testA)--
315
Referring Expression SegmentationRefCOCO+ (testA)--
288
Referring Expression SegmentationRefCOCO+ (val)--
272
Referring Expression SegmentationRefCOCO (val)--
261
Referring Expression SegmentationRefCOCO (testB)--
259
Question AnsweringHotpotQA
EM26.2
173
Question AnsweringPopQA
EM31.6
98
Referring Expression SegmentationRefCOCOg (val (U))--
95
Visual Question AnsweringInfoSeek--
22
Multimodal Retrieval-Augmented GenerationCRAG-MM (Overall)
Truthfulness20.5
18
Showing 10 of 16 rows

Other info

Follow for update