SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing
About
The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER2.99 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.8 | 833 | |
| Visual Speech Recognition | LRS3 (test) | WER28.3 | 159 | |
| Audio Captioning | AudioCaps (test) | CIDEr70.5 | 140 | |
| Speech Translation | CoVoST-2 (test) | Avg BLEU (15 Dir)35.7 | 46 | |
| Speech Translation | MuST-C (test) | BLEU Score16.9 | 29 | |
| Audio Captioning | Clotho (test) | METEOR18.2 | 21 | |
| Automatic Speech Recognition | SlideSpeech S95/L95 (dev) | WER8.3 | 12 | |
| Automatic Speech Recognition | SlideSpeech S95 L95 (test) | WER8.46 | 12 | |
| Automated Audio Captioning | Clotho (evaluation) | SPIDEr33.2 | 10 |