Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance

About

Visual speech recognition (VSR) aims to transcribe spoken content from silent lip-motion videos and is particularly challenging in Mandarin due to severe viseme ambiguity and pervasive homophones. We propose VALLR-Pin, a two-stage Mandarin VSR framework that extends the VALLR architecture by explicitly incorporating Pinyin as an intermediate representation. In the first stage, a shared visual encoder feeds dual decoders that jointly predict Mandarin characters and their corresponding Pinyin sequences, encouraging more robust visual-linguistic representations. In the second stage, an LLM-based refinement module takes the predicted Pinyin sequence together with an N-best list of character hypotheses to resolve homophone-induced ambiguities. To further adapt the LLM to visual recognition errors, we fine-tune it on synthetic instruction data constructed from model-generated Pinyin-text pairs, enabling error-aware correction. Experiments on public Mandarin VSR benchmarks demonstrate that VALLR-Pin consistently improves transcription accuracy under multi-speaker conditions, highlighting the effectiveness of combining phonetic guidance with lightweight LLM refinement.

Chang Sun, Dongliang Xie, Wanpeng Xie, Bo Qin, Hong Yang• 2025

Related benchmarks

TaskDatasetResultRank
Visual Speech RecognitionCNVSRC-Multi Mandarin (dev)
CER24.1
6
Visual Speech RecognitionSelf-Collected Dataset Mandarin (test)
CER32.22
4
Showing 2 of 2 rows

Other info

Follow for update