VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models
About
Omni-modal large language models (OLLMs) offer a promising end-to-end solution for slide-enhanced speech recognition due to their inherent multimodal capabilities. However, we found a fundamental issue faced by OLLMs: \textit{Visual Interference}, where models show a bias towards visible text over auditory signals, causing them to hallucinate slide content that was never spoken. To address this, we propose Visually-Anchored Policy Optimization (VAPO), which aims to reshape models' inference process to follow the human-like ``Look-then-Listen'' inference chain. Specifically, we design a temporally decoupled policy: the model first extracts visual priors in a <think> block to serve as semantic anchors, then generates the transcription in an <answer> block. The policy is optimized via multi-objective reinforcement learning. Furthermore, we introduce SlideASR-Bench, a comprehensive benchmark designed to address the scarcity of entity-rich data, comprising a large-scale synthetic corpus for training and a challenging real-world test set for evaluation. We conduct extensive evaluations demonstrating that VAPO effectively eliminates visual interference and achieves state-of-the-art performance on SlideASR-Bench and public datasets, significantly reducing entity recognition errors in specialized domains.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| SlideASR | SlideSpeech (dev) | WER8.62 | 16 | |
| SlideASR | SlideSpeech (test) | WER10.31 | 16 | |
| SlideASR | ChineseLips | CER1.298 | 15 | |
| Automatic Speech Recognition | SlideASR S (en) 1.0 | WER4.6 | 14 | |
| Automatic Speech Recognition | SlideASR S (zh) 1.0 | WER2.13 | 14 | |
| Automatic Speech Recognition | SlideASR 1.0 (R) | NE-WER26.48 | 14 |