Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Up to 36x Speedup: Mask-based Parallel Inference Paradigm for Key Information Extraction in MLLMs

About

Key Information Extraction (KIE) from visually-rich documents (VrDs) is a critical task, for which recent Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have demonstrated strong potential. However, their reliance on autoregressive inference, which generates outputs sequentially, creates a significant efficiency bottleneck, especially as KIE tasks often involve extracting multiple, semantically independent fields. To overcome this limitation, we introduce PIP: a Parallel Inference Paradigm for KIE. Our approach reformulates the problem by using "[mask]" tokens as placeholders for all target values, enabling their simultaneous generation in a single forward pass. To facilitate this paradigm, we develop a tailored mask pre-training strategy and construct large-scale supervised datasets. Experimental results show that our PIP-models achieve a 5-36x inference speedup with negligible performance degradation compared to traditional autoregressive base models. By substantially improving efficiency while maintaining high accuracy, PIP paves the way for scalable and practical real-world KIE solutions.

Xinzhong Wang, Ya Guo, Jing Li, Huan Chen, Yi Tu, Yijie Hong, Gongshen Liu, Huijia Zhu• 2026

Related benchmarks

TaskDatasetResultRank
Key Information ExtractionSROIE v1 (test)
ANLS97
10
Key Information ExtractionCORD v1 (test)
ANLS97.3
10
Key Information ExtractionFUNSD v1 (test)
ANLS79.3
10
Showing 3 of 3 rows

Other info

Follow for update