TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

About

Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.

Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem• 2025

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU27.3	559
Semantic segmentation	Cityscapes	mIoU47.35	494
Semantic segmentation	COCO Stuff	mIoU31.22	399
Referring Expression Comprehension	RefCOCO+ (val)	Accuracy53.6	354
Referring Expression Comprehension	RefCOCO (val)	Accuracy48.7	348
Referring Expression Comprehension	RefCOCO (testA)	Accuracy0.564	346
Referring Expression Comprehension	RefCOCOg (test)	Accuracy54.6	300
Referring Expression Comprehension	RefCOCOg (val)	Accuracy55.8	300
Referring Expression Comprehension	RefCOCO+ (testB)	Accuracy44.3	244
Semantic segmentation	Pascal Context	mIoU46.13	217

Showing 10 of 22 rows

Other info

Code

Follow for update

@wizwand_team Discord