VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

About

Open-world object detection aims to localize and recognize objects beyond a fixed closed-set label space. It is commonly divided into two categories, i.e., open-vocabulary detection, which assumes a predefined category list at test time, and open-ended detection, which requires generating candidate categories during the inference. Existing methods rely primarily on coarse textual semantics and parametric knowledge, which often provide insufficient visual evidence for fine-grained appearance variation, rare categories, and cluttered scenes. In this paper, we propose VL-SAM-v3, a unified framework that augments open-world detection with retrieval-grounded external visual memory. Specifically, once candidate categories are available, VL-SAM-v3 retrieves relevant visual prototypes from a non-parametric memory bank and transforms them into two complementary visual priors, i.e., sparse priors for instance-level spatial anchoring and dense priors for class-aware local context. These priors are integrated with the original detection prompts via Memory-Guided Prompt Refinement, enabling a shared retrieval-and-refinement mechanism that supports open-vocabulary and open-ended inference. Extensive zero-shot experiments on LVIS show that VL-SAM-v3 consistently improves detection performance under both open-vocabulary and open-ended inference, with particularly strong gains on rare categories. Moreover, experiments with a stronger open-vocabulary detector (i.e., SAM3) validate the generality of the proposed retrieval-and-refinement mechanism.

Chih-Chung Liu, Zhiwei Lin, Yongtao Wang• 2026

Related benchmarks

Task	Dataset	Result
Object Detection	LVIS (minival)	AP51.7	188
Object Detection	LVIS (val)	mAP54.1	174
Object Detection	LVIS mini (val)	mAP60.2	120
Object Detection	COCO	AP56.8	21
Open-ended instance segmentation	LVIS mini (val)	AP (Mask)39.9	3

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord