Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing

About

The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.

Ziyang Ma, Guanrou Yang, Wenxi Chen, Zhifu Gao, Yexing Du, Xiquan Li, Zhisheng Zheng, Haina Zhu, Jianheng Zhuo, Zheshu Song, Ruiyang Xu, Tiranrui Wang, Yifan Yang, Yanqiao Zhu, Zhikang Niu, Liumeng Xue, Yinghao Ma, Ruibin Yuan, Shiliang Zhang, Kai Yu, Eng Siong Chng, Xie Chen• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech (test-other)
WER2.99
966
Automatic Speech RecognitionLibriSpeech clean (test)
WER1.8
833
Visual Speech RecognitionLRS3 (test)
WER28.3
159
Audio CaptioningAudioCaps (test)
CIDEr70.5
140
Speech TranslationCoVoST-2 (test)
Avg BLEU (15 Dir)35.7
46
Speech TranslationMuST-C (test)
BLEU Score16.9
29
Audio CaptioningClotho (test)
METEOR18.2
21
Automatic Speech RecognitionSlideSpeech S95/L95 (dev)
WER8.3
12
Automatic Speech RecognitionSlideSpeech S95 L95 (test)
WER8.46
12
Automated Audio CaptioningClotho (evaluation)
SPIDEr33.2
10
Showing 10 of 18 rows

Other info

Follow for update