Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
About
We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Captioning | AudioCaps (test) | CIDEr0.79 | 140 | |
| Automatic Speech Recognition | LibriSpeech Other | WER3.13 | 75 | |
| Audio Understanding | MMAU v05.15.25 (test-mini) | Sound Score79.58 | 28 | |
| Audio Understanding | MMAU v05.15.25 (test) | Sound Score79.58 | 28 | |
| Audio Understanding | MMAU (test) | Speech Score66.37 | 25 | |
| Audio Understanding | MMAR (test) | Performance58.5 | 20 | |
| Long-Form Speech Understanding | AudioMarathon 1.0 (test) | Average Score60.6 | 16 | |
| Audio Reasoning | MMAU mini 1.0 (test) | Sound Score79.58 | 15 | |
| Audio Understanding | MMSU (test) | Overall Score61.4 | 15 | |
| Speech Content Extraction | AudioMarathon 1.0 (test) | SER21.7 | 15 |