Share your thoughts, 1 month free Claude Pro on usSee more

Correctness Prediction on Model-Query Evaluation (112 language models, 10 public benchmarks) (test)

70.12Accuracy (Prediction)

IRT-NET

Updated 1mo ago

Evaluation Results

Method	Links
IRT-NET 2026.01		70.12
LOCUS 2026.01		70.03
EMBEDLLM 2026.01		69.47
IRT-NET 2026.01		69.07
LOCUS 2026.01		68.33
LOCUS 2026.01		68.31
EMBEDLLM 2026.01		68.12
IRT-NET 2026.01		67.38
EMBEDLLM 2026.01		67.33