Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multi-hop Reasoning on HotpotQA (EM/F1)

90.4Accuracy

CoT2-Meta

28.41644.50860.676.692Mar 18, 2026Mar 20, 2026Mar 22, 2026Mar 25, 2026Mar 27, 2026Mar 29, 2026Apr 1, 2026
Updated 16d ago

Evaluation Results

MethodLinks
2026.03
90.4--
2026.03
87.4--
2026.03
85.1--
2026.03
82.5--
2026.03
78.5--
2026.04
60.2--
2026.04
59.1--
2026.04
57.3--
53.54055
51.543.555
514255
2026.04
44.8--
2026.03
4434.548.9
2026.04
43.9--
2026.03
42.535.546.7
2026.03
403144.4
2026.04
39.8--
2026.04
38.8--
2026.04
37.5--
2026.03
30.827.536.2
2025.11
-21.0924.25
2025.11
-33.5940.07
2025.11
-23.4431.09
2025.11
-18.7522.98
2025.11
-42.9749.7
2025.11
-27.3432.27
2025.11
-34.3846.83
2025.11
-27.3437.1
2025.11
-38.2847.03
2025.11
-44.5352.31