Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

About

We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.

Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro• 2025

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech (test-other)	WER3.13	1447
Automatic Speech Recognition	LibriSpeech clean (test)	WER1.57	1410
Audio Captioning	AudioCaps (test)	CIDEr0.79	222
Speaker Verification	VoxCeleb1 (Vox1-O)	--	160
Emotion Recognition in Conversation	MELD (test)	Weighted F145.14	152
Automatic Speech Recognition	LibriSpeech Other	WER3.13	140
Automatic Speech Recognition	LibriSpeech Clean	WER1.57	124
Multimodal Understanding	MMMU	MMMU Score72.42	110
Emotion Recognition	MELD (test)	--	89
Audio-Visual Question Answering	AVQA	Accuracy64.3	85

Showing 10 of 222 rows

...

Other info

Follow for update

@wizwand_team Discord