Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid

About

Recently, scaling images to high resolution has received much attention in multimodal large language models (MLLMs). Most existing practices adopt a sliding-window-style cropping strategy to adapt to resolution increase. Such a cropping strategy, however, can easily cut off objects and connected regions, which introduces semantic discontinuity and therefore impedes MLLMs from recognizing small or irregularly shaped objects or text, leading to a phenomenon we call the semantic sawtooth effect. This effect is particularly evident in lightweight MLLMs. To address this issue, we introduce a Complementary Image Pyramid (CIP), a simple, effective, and plug-and-play solution designed to mitigate semantic discontinuity during high-resolution image processing. In particular, CIP dynamically constructs an image pyramid to provide complementary semantic information for the cropping-based MLLMs, enabling them to richly acquire semantics at all levels. Furthermore, we introduce a Scale Compression Mechanism (SCM) to reduce the additional computational overhead by compressing the redundant visual tokens. Our experiments demonstrate that CIP can consistently enhance the performance across diverse architectures (e.g., MiniCPM-V-2, InternVL2, and LLaVA-OneVision), various model capacity (1B$\rightarrow$8B), and different usage configurations (training-free and fine-tuning). Leveraging the proposed CIP and SCM, we introduce a lightweight MLLM, Mini-Monkey, which achieves remarkable performance in both general multimodal understanding and document understanding. On the OCRBench, the 2B-version Mini-Monkey even surpasses the 8B model InternVL2-8B by 12 score. Additionally, training Mini-Monkey is cheap, requiring only eight RTX 3090 GPUs. The code is available at https://github.com/Yuliang-Liu/Monkey.

Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, Xiang Bai• 2024

Related benchmarks

Task	Dataset	Result
Text-based Visual Question Answering	TextVQA	Accuracy75.7	962
Mathematical Reasoning	MathVista	Score45.3	474
Multimodal Capability Evaluation	MM-Vet	Score39.8	393
Chart Question Answering	ChartQA	Accuracy76.5	371
OCR Evaluation	OCRBench	Score794	350
Document Visual Question Answering	DocVQA	ANLS87.4	301
Multi-discipline Multimodal Understanding	MMMU (val)	--	212
Diagram Understanding	AI2D (test)	Accuracy73.7	154
Multimodal Reasoning	MMStar	--	143
Infographic Question Answering	InfoVQA	ANLS60.1	117

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord