OSUM-Pangu: An Open-Source Multidimension Speech Understanding Foundation Model Built upon OpenPangu on Ascend NPUs
About
Recent advancements in Speech Large Language Models have significantly enhanced multi-dimensional speech understanding. However, the majority of high-performance frameworks are predominantly optimized for GPU centric ecosystems and proprietary backbones, creating a significant gap for deployment on non-CUDA computing infrastructures. In this paper, we present OSUM-Pangu, a fully open-source speech understanding foundation model developed on a completely non-CUDA software and hardware stack. By integrating an audio encoder with the openPangu-7B LLM backbone, we successfully implement the entire training and inference pipeline on the Ascend NPU platform. To facilitate efficient task alignment under non-CUDA resource constraints, we adopt a practical training process that sequentially bridges speech perception and user intent recognition. Experimental results demonstrate that OSUM-Pangu achieves task accuracy comparable to mainstream GPU-based models while maintaining robust natural language interaction capabilities. Our work provides a reproducible, non-CUDA baseline for the open-source speech community, promoting the independent evolution of multimodal intelligence.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech Other | WER8.36 | 96 | |
| Automatic Speech Recognition | LibriSpeech Clean | WER3.51 | 80 | |
| Emotion Recognition | MELD (test) | -- | 28 | |
| Automatic Speech Recognition | WenetSpeech (meeting) | WER10.49 | 23 | |
| Speech-to-Text Question-Answering | TriviaQA | Accuracy28.9 | 23 | |
| Speech-to-Text Question-Answering | WebQ | Accuracy29.5 | 23 | |
| Speech-to-Text Question-Answering | LlamaQ | Accuracy44.6 | 23 | |
| Automatic Speech Recognition | AISHELL-2 mic | CER3.01 | 12 | |
| Automatic Speech Recognition | AISHELL-2 i (iOS) | WER2.98 | 6 | |
| Age Classification | Common Voice (test) | Accuracy83.31 | 5 |