| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Audio-Visual Classification | CREMA-D (test) | Accuracy79.7 | 60 | |
| Multimodal Classification | CREMA-D | Accuracy77.92 | 28 | |
| Multimodal Classification | CREMA-D (test) | Multi Accuracy80.21 | 25 | |
| Speech Emotion Recognition | CREMA-D (test) | Accuracy68.12 | 24 | |
| Emotion Recognition | CREMA-D | Accuracy (6)68.4 | 23 | |
| Discrete Emotion Recognition | CREMA-D 18 (test) | Accuracy55.01 | 19 | |
| Emotion Classification | CREMA-D | F1 (Macro)77.9 | 18 | |
| Emotion recognition | CREMA-D (test) | Accuracy89.15 | 17 | |
| Emotion Recognition | CREMA-D 6-class | WAR79.36 | 17 | |
| Emotional Text-to-Speech | CREMA-D | Angry Accuracy89.2 | 15 | |
| Audio Classification | CREMA-D 6 | Top-1 Accuracy43.3 | 15 | |
| Mixed-emotion Text-to-Speech | CREMA-D (in-distribution) | Embedding Similarity (E-SIM)0.795 | 15 | |
| Audio Classification | Crema-D | Accuracy73 | 15 | |
| Speech Emotion Recognition | CREMA-D Subject Dependent (train test) | Macro Accuracy68.27 | 14 | |
| Speech Emotion Recognition | CREMA-D (subject-independent) | Mean Macro Accuracy68.57 | 14 | |
| Categorical Emotion Recognition | CREMA-D | UAR85.71 | 14 | |
| Neural Audio Compression | CREMA-D | ViSQOL Score4.32 | 13 | |
| Emotion Preservation | CREMA-D | MEDR1.19 | 13 | |
| Speech Emotion Recognition | CREMA-D 6 classes (test) | Weighted Accuracy (WA)75.2 | 12 | |
| Emotion Recognition | CREMA-D | WA (Weighted Average)56 | 12 | |
| Speech Emotion Recognition | CREMA-D | Weighted Accuracy95.24 | 12 | |
| Audio emotion recognition | CREMA-D | Accuracy70.47 | 11 | |
| Classification | CREMA-D (test) | Accuracy75.17 | 10 | |
| Audio Classification | CREMA-D (test) | Accuracy45.06 | 9 | |
| Talking Face Generation | CREMA-D | FID5.29 | 9 |