ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding
About
Open-vocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained free-energy minimization. Extensive experiments show that ExpAlign consistently improves open-vocabulary detection and zero-shot instance segmentation, particularly on long-tail categories. Most notably, it achieves 36.2 AP$_r$ on the LVIS minival split, outperforming other state-of-the-art methods at comparable model scale, while remaining lightweight and inference-efficient.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | -- | 2454 | |
| Instance Segmentation | COCO 2017 (val) | APm0.429 | 1144 | |
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy48.9 | 345 | |
| Referring Expression Comprehension | RefCOCO (val) | -- | 335 | |
| Referring Expression Comprehension | RefCOCO (testA) | -- | 333 | |
| Referring Expression Comprehension | RefCOCOg (val) | Accuracy65.6 | 291 | |
| Referring Expression Comprehension | RefCOCOg (test) | -- | 291 | |
| Referring Expression Comprehension | RefCOCO+ (testB) | Accuracy45.5 | 235 | |
| Referring Expression Comprehension | RefCOCO+ (testA) | -- | 207 | |
| Referring Expression Comprehension | RefCOCO (testB) | -- | 196 |