LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
About
While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the <seg> token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks. Code will be available at https://github.com/echo840/LIRA.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Generalized Referring Expression Segmentation | gRefCOCO (val) | cIoU40.9 | 165 | |
| Generalized Referring Expression Segmentation | gRefCOCO (testA) | cIoU52.4 | 159 | |
| Generalized Referring Expression Segmentation | gRefCOCO (testB) | cIoU44.9 | 141 | |
| Referring Segmentation | RefCOCO (val) | cIoU81.8 | 84 | |
| Referring Segmentation | RefCOCO (testA) | cIoU83.4 | 83 | |
| Referring Segmentation | RefCOCOg (val) | CIoU78.4 | 72 | |
| Reasoning Segmentation | gRefCOCO (testA) | gIoU50.4 | 22 | |
| Reasoning Segmentation | gRefCOCO (testB) | gIoU0.424 | 22 |