| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| AI-generated text detection | Long-form QA 3K generations corpus | Detection Accuracy (1% FPR)100 | 42 | |
| AI-generated text detection | Long-form QA 46K ShareGPT-augmented corpus | Detection Accuracy (1% FPR)100 | 18 | |
| AI-generated text detection | Long-form QA 9K pooled generations corpus | Detection Accuracy (at 1% FPR)100 | 18 | |
| Long-form QA | Long-form QA Short Q, Long A (test) | GPT4 Score6.182 | 15 | |
| Long-form Question Answering | Long-form QA (test) | Win Rate vs. Holistic Reward61.7 | 13 | |
| Faithfulness Evaluation | Long-Form QA | Correlation (Human Judgment)0.795 | 2 | |
| AI-generated text detection | Long-form QA 46K ShareGPT-augmented corpus 1.0 (test) | Detection Accuracy (1% FPR)- | 0 | |
| AI-generated text detection | Long-form QA 9K pooled generations corpus 1.0 (test) | Accuracy (1% FPR)- | 0 | |
| AI-generated text detection | Long-form QA 3K generations corpus 1.0 (test) | Detection Acc (1% FPR)- | 0 |