Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

About

While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.

Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang• 2024

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy87.4
345
Referring Expression ComprehensionRefCOCO (val)
Accuracy92.6
335
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.95
333
Referring Expression ComprehensionRefCOCOg (test)
Accuracy90
291
Referring Expression ComprehensionRefCOCOg (val)
Accuracy89.4
291
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy81.4
235
Referring Expression ComprehensionRefCOCO+ (testA)--
207
Referring Expression ComprehensionRefCOCO (testB)--
196
Referring Expression ComprehensionRefCOCO+ (test-A)
Accuracy92.1
172
Referring Expression ComprehensionRefCOCO+ (test-B)
Accuracy81.4
167
Showing 10 of 26 rows

Other info

Follow for update