Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Long-form QA

Benchmarks

Task NameDataset NameSOTA ResultTrend
AI-generated text detectionLong-form QA 3K generations corpus
Detection Accuracy (1% FPR)100
42
AI-generated text detectionLong-form QA 46K ShareGPT-augmented corpus
Detection Accuracy (1% FPR)100
18
AI-generated text detectionLong-form QA 9K pooled generations corpus
Detection Accuracy (at 1% FPR)100
18
Long-form QALong-form QA Short Q, Long A (test)
GPT4 Score6.182
15
Long-form Question AnsweringLong-form QA (test)
Win Rate vs. Holistic Reward61.7
13
Conformal PredictionLong-form QA events
Coverage82.62
2
Conformal PredictionLong-form QA landmarks
Coverage80.39
2
Conformal PredictionLong-form QA movies
Coverage87.71
2
Conformal PredictionLong-form QA cities
Coverage86.35
2
Conformal PredictionLong-form QA books
Coverage85.73
2
Conformal PredictionLong-form QA artworks
Coverage76.9
2
Conformal PredictionLong-form QA persons
Coverage79.82
2
Conformal PredictionLong-form QA inventions
Coverage85.12
2
Faithfulness EvaluationLong-Form QA
Correlation (Human Judgment)0.795
2
AI-generated text detectionLong-form QA 46K ShareGPT-augmented corpus 1.0 (test)
Detection Accuracy (1% FPR)-
0
AI-generated text detectionLong-form QA 9K pooled generations corpus 1.0 (test)
Accuracy (1% FPR)-
0
AI-generated text detectionLong-form QA 3K generations corpus 1.0 (test)
Detection Acc (1% FPR)-
0
Showing 17 of 17 rows