Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

About

Previous open-source large multimodal models (LMMs) have faced several limitations: (1) they often lack native integration, requiring adapters to align visual representations with pre-trained large language models (LLMs); (2) many are restricted to single-modal generation; (3) while some support multimodal generation, they rely on separate diffusion models for visual modeling and generation. To mitigate these limitations, we present Anole, an open, autoregressive, native large multimodal model for interleaved image-text generation. We build Anole from Meta AI's Chameleon, adopting an innovative fine-tuning strategy that is both data-efficient and parameter-efficient. Anole demonstrates high-quality, coherent multimodal generation capabilities. We have open-sourced our model, training framework, and instruction tuning data.

Ethan Chern, Jiadi Su, Yan Ma, Pengfei Liu• 2024

Related benchmarks

TaskDatasetResultRank
Image CaptioningMS COCO Karpathy (test)
CIDEr0.1507
682
Text-to-Image GenerationMS-COCO--
131
Image Generation EvaluationTCC
A@197.7
60
Image Generation EvaluationImage Generation Evaluation (ITS)
A@193.6
60
Image Generation EvaluationICC
A@193.7
60
Image Generation EvaluationImage Generation Evaluation IQ
A@192.9
60
Interleaved Image-Text GenerationInterSyn (test)
TCC3.09
28
Text-to-Image In-Context LearningT2IFMIT Text-to-Image Fast Mini-ImageNet
Accuracy11
18
Interleaved Image-Text GenerationOpenING
FDT58.38
15
Interleaved Image-Text GenerationWeaverBench
FDT49.37
15
Showing 10 of 23 rows

Other info

Follow for update