UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

About

In the era of Large Language Models (LLMs), tremendous strides have been made in the field of multimodal understanding. However, existing advanced algorithms are limited to effectively utilizing the immense representation capabilities and rich world knowledge inherent to these large pre-trained models, and the beneficial connections among tasks within the context of text-rich scenarios have not been sufficiently explored. In this work, we introduce UniDoc, a novel multimodal model equipped with text detection and recognition capabilities, which are deficient in existing approaches. Moreover, UniDoc capitalizes on the beneficial interactions among tasks to enhance the performance of each individual task. To implement UniDoc, we perform unified multimodal instruct tuning on the contributed large-scale instruction following datasets. Quantitative and qualitative experimental results show that UniDoc sets state-of-the-art scores across multiple challenging benchmarks. To the best of our knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding.

Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, Can Huang• 2023

Related benchmarks

Task	Dataset	Result
Text-based Visual Question Answering	TextVQA	Accuracy46.2	984
Chart Question Answering	ChartQA	Accuracy10.9	404
Document Visual Question Answering	InfoVQA	Accuracy0.147	85
Document-oriented Visual Question Answering	DocVQA	Accuracy7.7	84
Gait Recognition	CASIA-B LT (74 subjects) NM#5-6 (probe)	Rank-1 Accuracy (0°)95.1	24
Scene Text-Centric Visual Question Answering	STVQA	Accuracy0.352	20
Scene Text-Centric Visual Question Answering	OCRVQA	Accuracy36.8	14

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord