Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

About

Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.

Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem• 2025

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy53.6
345
Referring Expression ComprehensionRefCOCO (val)
Accuracy48.7
335
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.564
333
Referring Expression ComprehensionRefCOCOg (test)
Accuracy54.6
291
Referring Expression ComprehensionRefCOCOg (val)
Accuracy55.8
291
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy44.3
235
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy60.8
207
Referring Expression ComprehensionRefCOCO (testB)
Accuracy40.8
196
Multiple Object GroundingReasoning Segmentation Short query
gIoU28.5
10
Multiple Object GroundingReasoning Segmentation Long query
gIoU47.2
10
Showing 10 of 11 rows

Other info

Code

Follow for update