NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

About

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.

Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Kai Qiao, Junfeng Yuan, Shengqing Liu, Yi Zhang, Bowen Chen, Ming Lei, Jie Gao, Jie Wu• 2026

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech (test-other)	WER2.53	1447
Automatic Speech Recognition	LibriSpeech clean (test)	WER1.19	1410
Automatic Speech Recognition	LibriSpeech (dev-other)	WER2.45	535
Automatic Speech Recognition	AISHELL-1 (test)	CER0.57	177
Speech Recognition	LibriSpeech clean (dev)	WER0.0113	125
Automatic Speech Recognition	AISHELL-1 (dev)	CER0.43	66
Speech Recognition	VoxPopuli (test)	WER6.08	52
Automatic Speech Recognition	KeSpeech	CER4.4	35
Automatic Speech Recognition	AISHELL-2 (test_ios)	CER2.43	35
Automatic Speech Recognition	WenetSpeech (meeting)	--	23

Showing 10 of 30 rows

Other info

Follow for update

@wizwand_team Discord