Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

About

Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.

Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem• 2025

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU27.3
559
Semantic segmentationCityscapes
mIoU47.35
494
Semantic segmentationCOCO Stuff
mIoU31.22
399
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy53.6
354
Referring Expression ComprehensionRefCOCO (val)
Accuracy48.7
348
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.564
346
Referring Expression ComprehensionRefCOCOg (test)
Accuracy54.6
300
Referring Expression ComprehensionRefCOCOg (val)
Accuracy55.8
300
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy44.3
244
Semantic segmentationPascal Context
mIoU46.13
217
Showing 10 of 22 rows

Other info

Code

Follow for update