Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Language Adaptive Weight Generation for Multi-task Visual Grounding

About

Although the impressive performance in visual grounding, the prevailing approaches usually exploit the visual backbone in a passive way, i.e., the visual backbone extracts features with fixed weights without expression-related hints. The passive perception may lead to mismatches (e.g., redundant and missing), limiting further performance improvement. Ideally, the visual backbone should actively extract visual features since the expressions already provide the blueprint of desired visual features. The active perception can take expressions as priors to extract relevant visual features, which can effectively alleviate the mismatches. Inspired by this, we propose an active perception Visual Grounding framework based on Language Adaptive Weights, called VG-LAW. The visual backbone serves as an expression-specific feature extractor through dynamic weights generated for various expressions. Benefiting from the specific and relevant visual features extracted from the language-aware visual backbone, VG-LAW does not require additional modules for cross-modal interaction. Along with a neat multi-task head, VG-LAW can be competent in referring expression comprehension and segmentation jointly. Extensive experiments on four representative datasets, i.e., RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, validate the effectiveness of the proposed framework and demonstrate state-of-the-art performance.

Wei Su, Peihan Miao, Huanzhang Dou, Gaoang Wang, Liang Qiao, Zheyang Li, Xi Li• 2023

Related benchmarks

TaskDatasetResultRank
Multivariate ForecastingETTh1
MSE0.479
686
Multivariate Time-series ForecastingETTm1
MSE0.364
466
Multivariate Time-series ForecastingETTm2
MSE0.207
389
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy76.4
354
Multivariate ForecastingETTh2
MSE0.4
350
Referring Expression ComprehensionRefCOCO (val)
Accuracy86.6
344
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.893
342
Multivariate Time-series ForecastingWeather
MSE0.202
340
Referring Expression ComprehensionRefCOCOg (test)
Accuracy77
300
Referring Expression ComprehensionRefCOCOg (val)
Accuracy76.9
300
Showing 10 of 45 rows

Other info

Follow for update