| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Visual-only Speech Recognition | LRS2 (test) | WER12.6 | 63 | |
| Speech Recognition | LRS2 (test) | WER1.3 | 49 | |
| Visual Speech Recognition | LRS2 | Mean WER14.6 | 45 | |
| Audio-visual Speech Recognition | LRS2 (test) | WER1.3 | 34 | |
| Audio-visual speech separation | LRS2-2Mix (test) | SI-SNRi16 | 33 | |
| Lip Reading | LRS2 (test) | WER22.6 | 28 | |
| Automatic Speech Recognition | LRS2-2Mix (test) | WER17.74 | 18 | |
| Speech Enhancement | LRS2 mixed with VGGSound noises (test) | PESQ3.22 | 18 | |
| Talking Face Generation | LRS2 (test) | SSIM1 | 18 | |
| Audio-Visual Speech Separation | LRS2 (test) | SDRi12.46 | 14 | |
| Visual Speech Recognition | LRS2 v0.4 (test) | WER3.7 | 14 | |
| English Transcription | LRS2 clean (test) | ASR WER1.3 | 12 | |
| Audio-visual speech separation | LRS2 2Mix | SDRi15.9 | 12 | |
| Audio-Visual Speech Recognition | LRS2 (clean) | WER2.2 | 12 | |
| Automatic Visual Speech Recognition | LRS2 clean (test) | WER2.2 | 12 | |
| Lip-syncing | LRS2 1 (test) | LSE-D6.386 | 12 | |
| Audio-Visual Speech Recognition | LRS2 50% visual occlusion (test) | WER (Overall)6.4 | 10 | |
| Speech Separation | LRS2-2Mix (test) | GPU RTF (s) (Forward)0.0118 | 10 | |
| Talking Face Generation | LRS2 | ID-SIM1 | 8 | |
| Audio-visual speech separation | LRS2-3Mix (test) | SI-SNRi13.7 | 8 | |
| ASR Error Correction | LRS2 (test) | WER2.6 | 8 | |
| speaker separation | LRS2 synthetic (test) | SDR14.2 | 7 | |
| Audio Speech Recognition | LRS2 v0.4 (test) | WER3.9 | 7 | |
| Talking Head Generation | LRS2 35 | LSE-C7.287 | 6 | |
| Lip synchronisation | LRS2 3 (test) | Acc (5 frames)88.1 | 6 |