FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions
About
While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pretraining and the reliance on noisy, coarse web-crawled image-text pairs. To overcome these limitations, we introduce FineViT, a novel vision encoder specifically designed to unlock fine-grained perception. By replacing coarse web data with dense recaptions, we systematically mitigate information loss through a progressive training paradigm.: first, the encoder is trained from scratch at a high native resolution on billions of global recaptioned image-text pairs, establishing a robust, detail rich semantic foundation. Subsequently, we further enhance its local perception through LLM alignment, utilizing our curated FineCap-450M dataset that comprises over $450$ million high quality local captions. Extensive experiments validate the effectiveness of the progressive strategy. FineViT achieves state-of-the-art zero-shot recognition and retrieval performance, especially in long-context retrieval, and consistently outperforms multimodal visual encoders such as SigLIP2 and Qwen-ViT when integrated into MLLMs. We hope FineViT could serve as a powerful new baseline for fine-grained visual perception.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet V2 | Top-1 Acc75.5 | 611 | |
| Text-to-Image Retrieval | Flickr30K | -- | 531 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score55.05 | 431 | |
| Image Classification | ImageNet-ReaL | Precision@188.7 | 211 | |
| Text-to-Image Retrieval | MS-COCO | -- | 151 | |
| Multimodal Reasoning | MMMU (val) | Accuracy44.33 | 144 | |
| Image-to-Text Retrieval | MS-COCO | -- | 132 | |
| Multimodal Mathematical Reasoning | MathVista mini | Accuracy0.677 | 90 | |
| Image-to-Text Retrieval | DCI | -- | 79 | |
| Text-to-Image Retrieval | DCI | -- | 79 |