NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR
About
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.19 | 1207 | |
| Automatic Speech Recognition | LibriSpeech (test-other) | WER2.53 | 1206 | |
| Automatic Speech Recognition | LibriSpeech (dev-other) | WER2.45 | 486 | |
| Automatic Speech Recognition | AISHELL-1 (test) | CER0.57 | 105 | |
| Speech Recognition | LibriSpeech clean (dev) | WER0.0113 | 104 | |
| Automatic Speech Recognition | AISHELL-1 (dev) | CER0.43 | 57 | |
| Speech Recognition | VoxPopuli (test) | WER6.08 | 52 | |
| Automatic Speech Recognition | KeSpeech | CER4.4 | 35 | |
| Automatic Speech Recognition | AISHELL-2 (test_ios) | CER2.43 | 35 | |
| Automatic Speech Recognition | WenetSpeech (meeting) | -- | 23 |