Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models

About

Visual Question Answering (VQA) often requires coupling fine-grained perception with factual knowledge beyond the input image. Prior multimodal Retrieval-Augmented Generation (MM-RAG) systems improve factual grounding but lack an internal policy for when and how to retrieve. We propose PixSearch, the first end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning. During encoding, PixSearch emits <search> tokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries, eliminating the reliance on modular pipelines (detectors, segmenters, captioners, etc.). A two-stage supervised fine-tuning regimen with search-interleaved supervision teaches retrieval timing and query selection while preserving segmentation ability. On egocentric and entity-centric VQA benchmarks, PixSearch substantially improves factual consistency and generalization, yielding a 19.7% relative gain in accuracy on CRAG-MM compared to whole image retrieval, while retaining competitive reasoning performance on various VQA and text-only QA tasks.

Jeonghwan Kim, Renjie Tao, Sanat Sharma, Jiaqi Wang, Kai Sun, Zhaojiang Lin, Seungwhan Moon, Lambert Mathias, Anuj Kumar, Heng Ji, Xin Luna Dong• 2026

Related benchmarks

TaskDatasetResultRank
Referring Expression SegmentationRefCOCO (testA)--
257
Referring Expression SegmentationRefCOCO+ (testA)--
230
Referring Expression SegmentationRefCOCO+ (val)--
223
Referring Expression SegmentationRefCOCO (testB)--
213
Referring Expression SegmentationRefCOCO (val)--
212
Question AnsweringHotpotQA
EM26.2
109
Referring Expression SegmentationRefCOCOg (val (U))--
89
Question AnsweringPopQA
EM31.6
88
Visual Question AnsweringInfoSeek--
22
Multimodal Retrieval-Augmented GenerationCRAG-MM (Overall)
Truthfulness20.5
18
Showing 10 of 16 rows

Other info

Follow for update