Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

About

Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on the COCO dataset, limiting many tasks requiring the combination of perception and understanding. In this work, we aim to bridge this perception gap from both model designing and data development perspectives. We first introduce ChatRex, an MLLM with a decoupled perception design. Instead of having the LLM directly predict box coordinates, we feed the output boxes from a universal proposal network into the LLM, allowing it to output the corresponding box indices to represent its detection results, turning the regression task into a retrieval-based task that LLM handles more proficiently. From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding. After a three-stage training approach, ChatRex demonstrates strong perception and understanding performance, and the combination of these two capabilities also unlocks many attractive applications, demonstrating their complementary roles in MLLM. Code is available at https://github.com/IDEA-Research/ChatRex.

Qing Jiang, Gen Luo, Yuqin Yang, Yuda Xiong, Yihao Chen, Zhaoyang Zeng, Tianhe Ren, Lei Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy89.8
345
Referring Expression ComprehensionRefCOCO (val)--
335
Referring Expression ComprehensionRefCOCO (testA)--
333
Referring Expression ComprehensionRefCOCOg (val)
Accuracy89.8
291
Referring Expression ComprehensionRefCOCOg (test)--
291
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy79.3
235
Referring Expression ComprehensionRefCOCO+ (testA)--
207
Referring Expression ComprehensionRefCOCO (testB)--
196
Object DetectionLVIS (val)--
141
Referring Expression ComprehensionRefCOCO v1 (val)
Top-1 Accuracy91
49
Showing 10 of 14 rows

Other info

Follow for update