Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

F-LMM: Grounding Frozen Large Multimodal Models

About

Endowing Large Multimodal Models (LMMs) with visual grounding capability can significantly enhance AIs' understanding of the visual world and their interaction with humans. However, existing methods typically fine-tune the parameters of LMMs to learn additional segmentation tokens and overfit grounding and segmentation datasets. Such a design would inevitably cause a catastrophic diminution in the indispensable conversational capability of general AI assistants. In this paper, we comprehensively evaluate state-of-the-art grounding LMMs across a suite of multimodal question-answering benchmarks, observing drastic performance drops that indicate vanishing general knowledge comprehension and weakened instruction following ability. To address this issue, we present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations -- a straightforward yet effective design based on the fact that word-pixel correspondences conducive to visual grounding inherently exist in the attention mechanism of well-trained LMMs. Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits, which a SAM-based mask refiner can further optimise. Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data, but achieves competitive performance on referring expression segmentation and panoptic narrative grounding benchmarks while completely preserving LMMs' original conversational ability. Additionally, with instruction-following ability preserved and grounding ability obtained, F-LMM can be directly applied to complex tasks like reasoning segmentation, grounded conversation generation and visual chain-of-thought reasoning. Our code can be found at https://github.com/wusize/F-LMM.

Size Wu, Sheng Jin, Wenwei Zhang, Lumin Xu, Wentao Liu, Wei Li, Chen Change Loy• 2024

Related benchmarks

TaskDatasetResultRank
Referring Expression SegmentationRefCOCO+ (val)
cIoU66.4
201
Referring Image SegmentationRefCOCO (val)--
197
Referring Expression SegmentationRefCOCO (val)
cIoU76.1
190
Referring Image SegmentationRefCOCO+ (val)--
117
Referring Expression SegmentationRefCOCOg (val)
cIoU67.1
107
Referring Expression SegmentationRefCOCOg (val (U))
cIoU70.1
89
Referring Image SegmentationRefCOCOg (val)
oIoU67.1
37
Showing 7 of 7 rows

Other info

Follow for update