| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| AI-generated text detection | Long-form QA 3K generations corpus | Detection Accuracy (1% FPR)100 | 42 | |
| AI-generated text detection | Long-form QA 46K ShareGPT-augmented corpus | Detection Accuracy (1% FPR)100 | 18 | |
| AI-generated text detection | Long-form QA 9K pooled generations corpus | Detection Accuracy (at 1% FPR)100 | 18 | |
| Long-form QA | Long-form QA Short Q, Long A (test) | GPT4 Score6.182 | 15 | |
| Long-form Question Answering | Long-form QA (test) | Win Rate vs. Holistic Reward61.7 | 13 | |
| Conformal Prediction | Long-form QA events | Coverage82.62 | 2 | |
| Conformal Prediction | Long-form QA landmarks | Coverage80.39 | 2 | |
| Conformal Prediction | Long-form QA movies | Coverage87.71 | 2 | |
| Conformal Prediction | Long-form QA cities | Coverage86.35 | 2 | |
| Conformal Prediction | Long-form QA books | Coverage85.73 | 2 | |
| Conformal Prediction | Long-form QA artworks | Coverage76.9 | 2 | |
| Conformal Prediction | Long-form QA persons | Coverage79.82 | 2 | |
| Conformal Prediction | Long-form QA inventions | Coverage85.12 | 2 | |
| Faithfulness Evaluation | Long-Form QA | Correlation (Human Judgment)0.795 | 2 | |
| AI-generated text detection | Long-form QA 46K ShareGPT-augmented corpus 1.0 (test) | Detection Accuracy (1% FPR)- | 0 | |
| AI-generated text detection | Long-form QA 9K pooled generations corpus 1.0 (test) | Accuracy (1% FPR)- | 0 | |
| AI-generated text detection | Long-form QA 3K generations corpus 1.0 (test) | Detection Acc (1% FPR)- | 0 |