| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Watermarking | LFQA | TPR (FPR < 10^-4)100 | 40 | |
| Pairwise Ranking | LFQA | Pairwise Preference Accuracy77.24 | 13 | |
| Sycophancy | LFQA | Sycophancy (PD, L)0.276 | 6 | |
| Answer quality evaluation | LFQA | GPT-4o Score4.115 | 4 | |
| Multi-bit Watermarking | LFQA | Perplexity2.636 | 4 | |
| Long-form Question Answering | LFQA | AIS (Decomposition)90.9 | 4 | |
| Long-Form Question Answering | LFQA (test) | R-L38.2 | 3 | |
| Machine-Generated Text Detection | LFQA (10% Editing) | TPR99.9 | 3 | |
| Machine-Generated Text Detection | LFQA No Editing | TPR100 | 3 |