Hierarchical multimodal transformers for Multi-Page DocVQA

About

Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple pages that should be processed altogether. In this work we extend DocVQA to the multi-page scenario. For that, we first create a new dataset, MP-DocVQA, where questions are posed over multi-page documents instead of single pages. Second, we propose a new hierarchical method, Hi-VT5, based on the T5 architecture, that overcomes the limitations of current methods to process long multi-page documents. The proposed method is based on a hierarchical transformer architecture where the encoder summarizes the most relevant information of every page and then, the decoder takes this summarized information to generate the final answer. Through extensive experimentation, we demonstrate that our method is able, in a single stage, to answer the questions and provide the page that contains the relevant information to find the answer, which can be used as a kind of explainability measure.

Rub\`en Tito, Dimosthenis Karatzas, Ernest Valveny• 2022

Related benchmarks

Task	Dataset	Result
Document Visual Question Answering	DocVQA (test)	ANLS89.3	292
Multi-page Document Question Answering	MP-DocVQA	ANLS62	38
Multi-page Document Question Answering	MP-DocVQA (test)	ANLS0.6201	30
Multi-page Document Question Answering	DUDE	ANLS35.7	23
Multi-page Document Understanding	DUDE	ANLS23.1	21
Document Understanding	MPDocVQA	ANLS62	15
Long PDF Understanding	PaperPDF English 1.0	ANLS13.5	14
Document Question Answering	DUDE	ANLS0.3574	12
Comprehensive ESG Report Analysis	Chinese ESG Reports	Precision45.77	11
Hierarchy Alignment	Chinese ESG Reports 50 full	TBTA3.79	11

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord