Region-based Cluster Discrimination for Visual Representation Learning
About
Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation. To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). The pre-trained models have been released at https://github.com/deepglint/MVT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Segmentation | RefCOCO (testA) | cIoU85.3 | 217 | |
| Referring Expression Segmentation | RefCOCO+ (val) | cIoU79.4 | 201 | |
| Referring Expression Segmentation | RefCOCO (testB) | cIoU81.7 | 191 | |
| Referring Expression Segmentation | RefCOCO+ (testA) | cIoU82.8 | 190 | |
| Referring Expression Segmentation | RefCOCO (val) | cIoU83.5 | 190 | |
| Referring Expression Segmentation | RefCOCO+ (testB) | cIoU75.4 | 188 | |
| Referring Expression Segmentation | RefCOCOg (val) | cIoU79.8 | 107 | |
| Referring Expression Segmentation | RefCOCOg (test) | cIoU80.4 | 78 | |
| Referring Expression Segmentation | RefCOCO UMD (val) | cIoU83.5 | 50 |