Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Token-level Text Image Foundation Model for Document Understanding

About

In recent years, general visual foundation models (VFMs) have witnessed increasing adoption, particularly as image encoders for popular multi-modal large language models (MLLMs). However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenOCR, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, TokenVL, for VQA-based document understanding tasks. Finally, extensive experiments demonstrate the effectiveness of TokenOCR and TokenVL. Code, datasets, and weights will be available at https://github.com/Token-family/TokenFD.

Tongkun Guan, Zining Wang, Pei Fu, Zhengtao Guo, Wei Shen, Kai Zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang• 2025

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA
Accuracy79.3
807
Chart Question AnsweringChartQA
Accuracy86.5
356
Document Visual Question AnsweringDocVQA
ANLS93.8
263
Infographic Question AnsweringInfoVQA
ANLS75.3
90
OCR Performance EvaluationOCRBench
Score86
63
Showing 5 of 5 rows

Other info

Follow for update