Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ILDAE: Instance-Level Difficulty Analysis of Evaluation Data

About

Knowledge of questions' difficulty level helps a teacher in several ways, such as estimating students' potential quickly by asking carefully selected questions and improving quality of examination by modifying trivial and hard questions. Can we extract such benefits of instance difficulty in NLP? To this end, we conduct Instance-Level Difficulty Analysis of Evaluation data (ILDAE) in a large-scale setup of 23 datasets and demonstrate its five novel applications: 1) conducting efficient-yet-accurate evaluations with fewer instances saving computational cost and time, 2) improving quality of existing evaluation datasets by repairing erroneous and trivial instances, 3) selecting the best model based on application requirements, 4) analyzing dataset characteristics for guiding future data creation, 5) estimating Out-of-Domain performance reliably. Comprehensive experiments for these applications result in several interesting findings, such as evaluation using just 5% instances (selected via ILDAE) achieves as high as 0.93 Kendall correlation with evaluation using complete dataset and computing weighted accuracy using difficulty scores leads to 5.2% higher correlation with Out-of-Domain performance. We release the difficulty scores and hope our analyses and findings will bring more attention to this important yet understudied field of leveraging instance difficulty in evaluations.

Neeraj Varshney, Swaroop Mishra, Chitta Baral• 2022

Related benchmarks

TaskDatasetResultRank
Ranking correlation with full dataset evaluationSNLI
Kendall Correlation0.93
10
Ranking correlation with full dataset evaluationPAWS Wiki
Kendall Correlation0.96
10
Ranking correlation with full dataset evaluationAGNews
Kendall Correlation0.89
10
Ranking correlation with full dataset evaluationQNLI
Kendall Correlation0.91
10
Ranking correlation with full dataset evaluationMRPC
Kendall Correlation0.65
10
Ranking correlation with full dataset evaluationSocialIQA
Kendall Correlation0.81
10
Ranking correlation with full dataset evaluationQQP
Kendall Correlation0.95
10
Ranking correlation with full dataset evaluationDNLI
Kendall Correlation0.95
10
Ranking correlation with full dataset evaluationSWAG
Kendall Correlation0.93
10
Ranking correlation with full dataset evaluationMNLI
Kendall Correlation0.95
10
Showing 10 of 24 rows

Other info

Code

Follow for update