Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

About

Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce refocusing on local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on various datasets from three distinct domains, and shows great potential to boost the existing methods. Code: https://github.com/EricTan7/RAM.

Hao Tan, Zichang Tan, Jun Li, Ajian Liu, Jun Wan, Zhen Lei• 2025

Related benchmarks

Task	Dataset	Result
Multi-Label Classification	NUS-WIDE (test)	mAP53.33	124
Multi-Label Classification	VOC 07	mAP87.89	73
Multi-label recognition	PASCAL VOC 2007 (test)	Avg. mAP85.67	44
Multi-Label Classification	NUS-WIDE 925/81 (unseen)	mAP (Mean Average Precision)53.2	43
Multi-Label Classification	NUS-WIDE	mAP50.52	36
Multi-label image recognition	COCO 2014	mAP13.84	15
Multi-label recognition	COCO 2014 (test)	mAP61.08	12
Generalized Zero-Shot Learning	NUS-WIDE	mAP50.72	11
Generalized Zero-Shot Learning	COCO 2014	mAP50.72	11
Multi-label recognition	NUS-WIDE seen & unseen	F1 Score @ 323.5	10

Showing 10 of 18 rows

Other info

Code

Follow for update

@wizwand_team Discord