Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

About

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.

Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Kai Qiao, Junfeng Yuan, Shengqing Liu, Yi Zhang, Bowen Chen, Ming Lei, Jie Gao, Jie Wu• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER1.19
1207
Automatic Speech RecognitionLibriSpeech (test-other)
WER2.53
1206
Automatic Speech RecognitionLibriSpeech (dev-other)
WER2.45
486
Automatic Speech RecognitionAISHELL-1 (test)
CER0.57
105
Speech RecognitionLibriSpeech clean (dev)
WER0.0113
104
Automatic Speech RecognitionAISHELL-1 (dev)
CER0.43
57
Speech RecognitionVoxPopuli (test)
WER6.08
52
Automatic Speech RecognitionKeSpeech
CER4.4
35
Automatic Speech RecognitionAISHELL-2 (test_ios)
CER2.43
35
Automatic Speech RecognitionWenetSpeech (meeting)--
23
Showing 10 of 30 rows

Other info

Follow for update