DiT: Self-supervised Pre-training for Document Image Transformer

About

Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose \textbf{DiT}, a self-supervised pre-trained \textbf{D}ocument \textbf{I}mage \textbf{T}ransformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 $\rightarrow$ 92.69), document layout analysis (91.0 $\rightarrow$ 94.9), table detection (94.23 $\rightarrow$ 96.55) and text detection for OCR (93.07 $\rightarrow$ 94.29). The code and pre-trained models are publicly available at \url{https://aka.ms/msdit}.

Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei• 2022

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy10	1453
Visual Question Answering	ChartQA	--	519
Optical Character Recognition	OCRBench	--	433
Visual Question Answering	AI2D	Accuracy49.9	317
Document Classification	RVL-CDIP (test)	Accuracy92.69	306
Document Visual Question Answering	DocVQA	ANLS11.3	301
Infographic Question Answering	InfoVQA	ANLS19.2	117
Vision Understanding	MMMU	Accuracy28.8	65
Web agent tasks	Mind2Web Cross-Task	Step Success Rate4.1	64
General Visual Understanding	RealworldQA	Accuracy43.9	62

Showing 10 of 39 rows

Other info

Code

Follow for update

@wizwand_team Discord