| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Audio Classification | AudioSet 20K | mAP47.8 | 128 | |
| Audio Classification | AudioSet 2M | mAP50.5 | 79 | |
| 1D audio reconstruction | AudioSet | NMSE0.006 | 63 | |
| Classification | AudioSet (test) | mAP49.6 | 57 | |
| Sound Classification | AudioSet (evaluation) | mAP47.1 | 39 | |
| Audio Reconstruction | AudioSet (eval) | Mel Distance0.382 | 35 | |
| Acoustic event detection | AudioSet (test) | mAP0.462 | 34 | |
| Audio Event Tagging | AudioSet AS-2M (full) | mAP50.2 | 33 | |
| Audio Classification | AudioSet-2M (full) | mAP48.6 | 32 | |
| Audio Classification | AudioSet | mAP48.5 | 25 | |
| Audio Event Tagging | AudioSet (AS-20K) | mAP46.7 | 24 | |
| Audio Classification | AudioSet Full (test) | mAP45.9 | 23 | |
| Classification | AudioSet AS-2M | mAP (%)50.2 | 21 | |
| Generalized Zero-Shot Retrieval (Text-to-Audio) | AudioSet ZSL (test) | mAP (S)72.25 | 19 | |
| Sound Event Detection | AudioSet Strongly-labeled (test) | PSDS1 (w/o var-pen)0.374 | 18 | |
| Audio-visual event classification | AudioSet 2M | mAP (Audio-only)49.1 | 16 | |
| Generalized Zero-Shot Classification | AudioSet ZSL (test) | mAcc (Seen)50.96 | 16 | |
| Audio Reconstruction | AudioSet (test) | Mel Distance (44kHz)0.417 | 15 | |
| Audio Tagging | AudioSet (test) | mAP50 | 14 | |
| Audio Classification | AudioSet-20K (test) | mAP37.4 | 13 | |
| Audio Classification | AudioSet (balanced) | mAP37.8 | 13 | |
| Sound Event Detection | AudioSet Strong (407 classes) | PSDS1A0.496 | 12 | |
| Audio-visual classification | AudioSet | Top-1 Accuracy55.85 | 12 | |
| Audio Classification | AudioSet 20K v1 | mAP41.9 | 11 | |
| Audio-visual event classification | AudioSet 20K | mAP (Audio-only)42.4 | 11 |