Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding

About

Open-vocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained free-energy minimization. Extensive experiments show that ExpAlign consistently improves open-vocabulary detection and zero-shot instance segmentation, particularly on long-tail categories. Most notably, it achieves 36.2 AP$_r$ on the LVIS minival split, outperforming other state-of-the-art methods at comparable model scale, while remaining lightweight and inference-efficient.

Junyi Hu, Tian Bai, Fengyi Wu, Wenyan Li, Zhenming Peng, Yi Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)--
2454
Instance SegmentationCOCO 2017 (val)
APm0.429
1144
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy48.9
345
Referring Expression ComprehensionRefCOCO (val)--
335
Referring Expression ComprehensionRefCOCO (testA)--
333
Referring Expression ComprehensionRefCOCOg (val)
Accuracy65.6
291
Referring Expression ComprehensionRefCOCOg (test)--
291
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy45.5
235
Referring Expression ComprehensionRefCOCO+ (testA)--
207
Referring Expression ComprehensionRefCOCO (testB)--
196
Showing 10 of 15 rows

Other info

Follow for update