ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

About

Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on the COCO dataset, limiting many tasks requiring the combination of perception and understanding. In this work, we aim to bridge this perception gap from both model designing and data development perspectives. We first introduce ChatRex, an MLLM with a decoupled perception design. Instead of having the LLM directly predict box coordinates, we feed the output boxes from a universal proposal network into the LLM, allowing it to output the corresponding box indices to represent its detection results, turning the regression task into a retrieval-based task that LLM handles more proficiently. From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding. After a three-stage training approach, ChatRex demonstrates strong perception and understanding performance, and the combination of these two capabilities also unlocks many attractive applications, demonstrating their complementary roles in MLLM. Code is available at https://github.com/IDEA-Research/ChatRex.

Qing Jiang, Gen Luo, Yuqin Yang, Yuda Xiong, Yihao Chen, Zhaoyang Zeng, Tianhe Ren, Lei Zhang• 2024

Related benchmarks

Task	Dataset	Result
Referring Expression Comprehension	RefCOCO+ (val)	Accuracy89.8	354
Referring Expression Comprehension	RefCOCO (val)	--	348
Referring Expression Comprehension	RefCOCO (testA)	--	346
Referring Expression Comprehension	RefCOCOg (val)	Accuracy89.8	300
Referring Expression Comprehension	RefCOCOg (test)	--	300
Referring Expression Comprehension	RefCOCO+ (testB)	Accuracy79.3	244
Referring Expression Comprehension	RefCOCO+ (testA)	--	216
Referring Expression Comprehension	RefCOCO (testB)	--	213
Object Detection	LVIS (val)	--	170
Referring Expression Comprehension	RefCOCO v1 (val)	Top-1 Accuracy91	49

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord