Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Region-based Cluster Discrimination for Visual Representation Learning

About

Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation. To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). The pre-trained models have been released at https://github.com/deepglint/MVT.

Yin Xie, Kaicheng Yang, Xiang An, Kun Wu, Yongle Zhao, Weimo Deng, Zimin Ran, Yumeng Wang, Ziyong Feng, Roy Miles, Ismail Elezi, Jiankang Deng• 2025

Related benchmarks

TaskDatasetResultRank
Referring Expression SegmentationRefCOCO (testA)
cIoU85.3
315
Referring Expression SegmentationRefCOCO+ (testA)
cIoU82.8
288
Referring Expression SegmentationRefCOCO+ (val)
cIoU79.4
272
Referring Expression SegmentationRefCOCO (val)
cIoU83.5
261
Referring Expression SegmentationRefCOCO (testB)
cIoU81.7
259
Referring Expression SegmentationRefCOCO+ (testB)
cIoU75.4
256
Referring Expression SegmentationRefCOCOg (val)
cIoU79.8
172
Referring Expression SegmentationRefCOCOg (test)
cIoU80.4
166
Referring Expression SegmentationRefCOCO UMD (val)
cIoU83.5
50
Multimodal UnderstandingLLaVA-NeXT Multimodal Understanding Suite
ChartQA Accuracy82.6
9
Showing 10 of 10 rows

Other info

Follow for update