Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Question Answering on HotpotQA (test) (ACC, AUROC, Brier, ECE Metrics)

68.6Accuracy

GPT-4.1

27.41638.10848.859.492Nov 18, 2025
Updated 16d ago

Evaluation Results

MethodLinks
2025.11
68.6---
2025.11
68.277.220.817
2025.11
68.28415.58.4
2025.11
67.972.829.129
2025.11
67.375.523.220.6
2025.11
66.9---
2025.11
66.886.215.18.2
2025.11
65.98717.413.7
2025.11
59.477.720.311.6
2025.11
59.17529.328.7
2025.11
58.3---
2025.11
57.269.827.624.1
2025.11
43.968.446.347.4
2025.11
43.868.931.829.5
2025.11
43.7---
2025.11
43.759.648.548.6
2025.11
43.770.24646.2
2025.11
42.666.451.852.6
2025.11
33.671.433.632.2
2025.11
3377.637.643.2
2025.11
31.965.65557.2
2025.11
31.776.533.338.5
31.778.120.316.9
2025.11
30.3---
2025.11
30.358.663.964.7
2025.11
30.373.348.749
2025.11
29.9---
2025.11
2964.263.264.7