HSD: Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding
About
Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust generalization, VLM-based end-to-end approaches have emerged as the mainstream paradigm in recent years. However, these models often suffer from substantial inference latency, as they must autoregressively generate long, full-page sequences when processing long-form documents. While recent hybrid methods mitigate this issue via region-level parallel decoding with VLMs, independent region decoding loses full-page context and might weaken global coherence. To address this issue, we propose Hierarchical Speculative Decoding (HSD), a two-stage local-to-global framework for document parsing. HSD first employs a lightweight pipeline drafter to predict region partitions and generate coarse drafts for each region. The first stage verifies the generated region-level drafts in parallel for efficiency, while the second stage further performs page-level verification on these refined outputs to preserve full-page coherence. Experimental results show that our HSD achieves a 2.78x near-lossless speedup with HunyuanOCR on OmniDocBench v1.5 and up to 7.04x speedup on long-document parsing tasks, demonstrating the effectiveness of our proposed method. We will release our code to facilitate reproducibility.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Parsing | OmniDocBench v1.5 | AAL1.96 | 6 | |
| Document Parsing | Ocean-OCR-Bench v1 (test) | AAL (Overall)7.88 | 3 | |
| Document Processing Acceleration | olmOCR-Bench long documents filtered > 8192 tokens | AAL4.08 | 2 | |
| Speculative Decoding Acceleration | OmniDocBench > 8192 tokens v1.5 | AAL (Overall)4.98 | 2 |