| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Person Re-identification | GRID | Rank-1 Acc56.9 | 44 | |
| Person Re-Identification | GRID (test) | Rank-1 Acc57.2 | 24 | |
| Person Re-identification | GRID (target) | mAP60.1 | 20 | |
| Lip-reading | GRID (test) | WER1.09 | 18 | |
| Person Re-identification | GRID Protocol-1 | mAP68.1 | 16 | |
| Multi-speaker Dubbing | GRID Dub 1.0 (test) | SPK-SIM (%)100 | 12 | |
| Person Re-identification | GRID G (test) | R156.4 | 12 | |
| Video-to-Speech Synthesis | GRID (test) | Sim-O0.87 | 11 | |
| Link Prediction | Grid probe (test) | AUC0.639 | 11 | |
| Movie Dubbing | GRID Dubbing Setting 2.0 | LSE-C7.134 | 10 | |
| Movie Dubbing | GRID Dubbing Setting 1.0 | LSE-C7.13 | 10 | |
| Constrained Reinforcement Learning | Grid | Episodic Reward276.3 | 8 | |
| Speech Reconstruction | GRID (speaker-dependent) | STOI0.738 | 7 | |
| Person Re-identification | GRID P=900 (test) | Rank-116.56 | 7 | |
| Dubbing | GRID | DD0 | 6 | |
| Movie Dubbing | GRID2V2C | DD (Sync Error)0 | 6 | |
| Graph Generation | Grid (test) | Train Time (s)0.28 | 6 | |
| Video-Driven Text-to-Speech | GRID standard (test) | LSE-C7.68 | 6 | |
| Lipreading | GRID | WER2.9 | 6 | |
| LockedRoom | MiniGrid | Mean Return0.01 | 5 | |
| Speech Separation | GRID (test) | SDR1.46 | 5 | |
| Anomaly Detection | Grid Texture (test) | AUC0.983 | 5 | |
| Speech Enhancement (3 Speakers) | GRID speaker-independent (test) | SDR4.02 | 5 | |
| Speech Enhancement (2 Speakers) | GRID speaker-independent (test) | SDR8.05 | 5 | |
| Lip to Speech | GRID unseen (test) | STOI0.731 | 5 |