A Token-level Text Image Foundation Model for Document Understanding

About

In recent years, general visual foundation models (VFMs) have witnessed increasing adoption, particularly as image encoders for popular multi-modal large language models (MLLMs). However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenOCR, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, TokenVL, for VQA-based document understanding tasks. Finally, extensive experiments demonstrate the effectiveness of TokenOCR and TokenVL. Code, datasets, and weights will be available at https://github.com/Token-family/TokenFD.

Tongkun Guan, Zining Wang, Pei Fu, Zhengtao Guo, Wei Shen, Kai Zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang• 2025

Related benchmarks

Task	Dataset	Result
Text-based Visual Question Answering	TextVQA	Accuracy79.3	962
Chart Question Answering	ChartQA	Accuracy86.5	371
Document Visual Question Answering	DocVQA	ANLS93.8	301
Infographic Question Answering	InfoVQA	ANLS75.3	117
OCR Performance Evaluation	OCRBench	Score86	68

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord