Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

About

We introduce FLARE, a family of vision language models (VLMs) with a fully vision-language alignment and integration paradigm. Unlike existing approaches that rely on single MLP projectors for modality alignment and defer cross-modal interaction to LLM decoding, FLARE achieves deep, dynamic integration throughout the pipeline. Our key contributions include: (1) Text-Guided Vision Encoding that incorporates textual information during vision encoding to achieve pixel-level alignment; (2) Context-Aware Alignment Decoding that aggregates visual features conditioned on textual context during decoding for query-level integration; (3) Dual-Semantic Mapping Loss to supervise feature mapping from both modalities and enable modality-level bridging; and (4) Text-Driven VQA Synthesis that leverages high-quality text to generate VQA pairs and synthesize corresponding images, enabling data-level optimization. We train FLARE at 3B and 8B scales under both fixed and dynamic resolution settings, demonstrating that our full-modality alignment significantly outperforms existing methods while maintaining strong generalizability. FLARE 3B surpasses Cambrian-1 8B and Florence-VL 8B using only 630 vision tokens. Ablation studies reveal that FLARE achieves superior performance over existing methods with minimal computational cost. Even without dynamic resolution, FLARE outperforms LLaVA-NeXT, validating the effectiveness of our approach. We release our code, model weights, and dataset in https://github.com/starriver030515/FLARE.

Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
2019
Multimodal UnderstandingMMBench--
847
Multimodal UnderstandingMM-Vet
MM-Vet Score62.8
631
Visual Question AnsweringChartQA
Accuracy83.4
519
Mathematical ReasoningMathVista
Score61.1
474
Multimodal UnderstandingMMStar--
407
OCR EvaluationOCRBench--
350
Visual Question AnsweringAI2D
Accuracy83.6
317
Multimodal UnderstandingMMMU
MMMU Score46.3
232
Visual Question AnsweringTextVQA
TextVQA Accuracy79.7
210
Showing 10 of 17 rows

Other info

Follow for update