Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

About

We present Perceive Anything Model (PAM), a conceptually straightforward and efficient framework for comprehensive region-level visual understanding in images and videos. Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categories, label definition, functional explanations, and detailed captions. A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features, which inherently carry general vision, localization, and semantic priors into multi-modal tokens for LLM comprehension. To support robust multi-granularity understanding, we also develop a dedicated data refinement and augmentation pipeline, yielding a high-quality dataset of 1.5M image and 0.6M video region-semantic annotations, including novel region-level streaming video caption data. PAM is designed for lightweightness and efficiency, while also demonstrates strong performance across a diverse range of region understanding tasks. It runs 1.2-2.4x faster and consumes less GPU memory than prior approaches, offering a practical solution for real-world applications. We believe that our effective approach will serve as a strong baseline for future research in region-level visual understanding.

Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, Hongsheng Li• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Reasoning	MMStar	--	143
Real-world Multimodal Reasoning	RealworldQA	Accuracy1.7	62
Multimodal Reasoning	MMVP	--	26
Video Referring	VideoRefer-Bench-D	SC3.92	23
Region-level captioning	RefCOCOg (test)	CIDEr143.1	18
Category-level image recognition	LVIS	Similarity Score88.6	18
Visual Question Answering	GAR-Bench-VQA	Overall VQA Score2.4	17
Region Captioning	VideoRefer-D (test)	Average Score3.14	16
Localized relational captioning	GAR-Bench Cap	Overall Score21.1	15
Open world region level image recognition	PACO	Semantic Similarity87.4	9

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord