Long-form QA

Benchmarks

Task Name	Dataset Name	SOTA Result
Long-form QA Factuality Detection	Long-form QA benchmark factuality target	PR-AUC17.3	48
AI-generated text detection	Long-form QA 3K generations corpus	Detection Accuracy (1% FPR)100	42
AI-generated text detection	Long-form QA 46K ShareGPT-augmented corpus	Detection Accuracy (1% FPR)100	18
AI-generated text detection	Long-form QA 9K pooled generations corpus	Detection Accuracy (at 1% FPR)100	18
Long-form QA	Long-form QA Short Q, Long A (test)	GPT4 Score6.182	15
Long-form Question Answering	Long-form QA (test)	Win Rate vs. Holistic Reward61.7	13
Conformal Prediction	Long-form QA events	Coverage82.62	2
Conformal Prediction	Long-form QA landmarks	Coverage80.39	2
Conformal Prediction	Long-form QA movies	Coverage87.71	2
Conformal Prediction	Long-form QA cities	Coverage86.35	2
Conformal Prediction	Long-form QA books	Coverage85.73	2
Conformal Prediction	Long-form QA artworks	Coverage76.9	2
Conformal Prediction	Long-form QA persons	Coverage79.82	2
Conformal Prediction	Long-form QA inventions	Coverage85.12	2
Faithfulness Evaluation	Long-Form QA	Correlation (Human Judgment)0.795	2
AI-generated text detection	Long-form QA 46K ShareGPT-augmented corpus 1.0 (test)	Detection Accuracy (1% FPR)-	0
AI-generated text detection	Long-form QA 9K pooled generations corpus 1.0 (test)	Accuracy (1% FPR)-	0
AI-generated text detection	Long-form QA 3K generations corpus 1.0 (test)	Detection Acc (1% FPR)-	0

Showing 18 of 18 rows