VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models

About

Omni-modal large language models (OLLMs) offer a promising end-to-end solution for slide-enhanced speech recognition due to their inherent multimodal capabilities. However, we found a fundamental issue faced by OLLMs: \textit{Visual Interference}, where models show a bias towards visible text over auditory signals, causing them to hallucinate slide content that was never spoken. To address this, we propose Visually-Anchored Policy Optimization (VAPO), which aims to reshape models' inference process to follow the human-like ``Look-then-Listen'' inference chain. Specifically, we design a temporally decoupled policy: the model first extracts visual priors in a <think> block to serve as semantic anchors, then generates the transcription in an <answer> block. The policy is optimized via multi-objective reinforcement learning. Furthermore, we introduce SlideASR-Bench, a comprehensive benchmark designed to address the scarcity of entity-rich data, comprising a large-scale synthetic corpus for training and a challenging real-world test set for evaluation. We conduct extensive evaluations demonstrating that VAPO effectively eliminates visual interference and achieves state-of-the-art performance on SlideASR-Bench and public datasets, significantly reducing entity recognition errors in specialized domains.

Rui Hu, Delai Qiu, Yining Wang, Shengping Liu, Jitao Sang• 2025

Related benchmarks

Task	Dataset	Result
SlideASR	SlideSpeech (dev)	WER8.62	16
SlideASR	SlideSpeech (test)	WER10.31	16
SlideASR	ChineseLips	CER1.298	15
Automatic Speech Recognition	SlideASR S (en) 1.0	WER4.6	14
Automatic Speech Recognition	SlideASR S (zh) 1.0	WER2.13	14
Automatic Speech Recognition	SlideASR 1.0 (R)	NE-WER26.48	14

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord