Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid

About

Recently, scaling images to high resolution has received much attention in multimodal large language models (MLLMs). Most existing practices adopt a sliding-window-style cropping strategy to adapt to resolution increase. Such a cropping strategy, however, can easily cut off objects and connected regions, which introduces semantic discontinuity and therefore impedes MLLMs from recognizing small or irregularly shaped objects or text, leading to a phenomenon we call the semantic sawtooth effect. This effect is particularly evident in lightweight MLLMs. To address this issue, we introduce a Complementary Image Pyramid (CIP), a simple, effective, and plug-and-play solution designed to mitigate semantic discontinuity during high-resolution image processing. In particular, CIP dynamically constructs an image pyramid to provide complementary semantic information for the cropping-based MLLMs, enabling them to richly acquire semantics at all levels. Furthermore, we introduce a Scale Compression Mechanism (SCM) to reduce the additional computational overhead by compressing the redundant visual tokens. Our experiments demonstrate that CIP can consistently enhance the performance across diverse architectures (e.g., MiniCPM-V-2, InternVL2, and LLaVA-OneVision), various model capacity (1B$\rightarrow$8B), and different usage configurations (training-free and fine-tuning). Leveraging the proposed CIP and SCM, we introduce a lightweight MLLM, Mini-Monkey, which achieves remarkable performance in both general multimodal understanding and document understanding. On the OCRBench, the 2B-version Mini-Monkey even surpasses the 8B model InternVL2-8B by 12 score. Additionally, training Mini-Monkey is cheap, requiring only eight RTX 3090 GPUs. The code is available at https://github.com/Yuliang-Liu/Monkey.

Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, Xiang Bai• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMathVista
Score45.3
322
OCR EvaluationOCRBench
Score794
296
Multimodal Capability EvaluationMM-Vet
Score39.8
282
Chart Question AnsweringChartQA
Accuracy76.5
229
Multi-discipline Multimodal UnderstandingMMMU (val)--
167
Document Visual Question AnsweringDocVQA
ANLS87.4
164
Diagram UnderstandingAI2D (test)
Accuracy73.7
107
Multimodal ReasoningMMStar--
81
Hallucination and Visual Reasoning EvaluationHallusionBench
Score30.9
37
Multimodal Optical Character RecognitionOCRBench
Recognition Score250
34
Showing 10 of 14 rows

Other info

Follow for update