Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Knowledge-Intensive Reasoning on HotpotQA (F1 score)

0.654F1 Score

Llama3.1-8B + ARPO

0.074720.225110.37550.52589Sep 28, 2025Nov 8, 2025Dec 19, 2025Jan 29, 2026Mar 11, 2026Apr 21, 2026Jun 1, 2026
Updated 1d ago

Evaluation Results

MethodLinks
2025.12
0.654--
2026.06
0.62446.71.98
2026.06
0.59243.91.97
2025.12
0.59--
2025.12
0.588--
2026.06
0.58844.32.89
2025.12
0.585--
2025.12
0.578--
2025.12
0.577--
2025.12
0.571--
2025.12
0.566--
2025.12
0.565--
2025.12
0.559--
2025.12
0.551--
2025.12
0.548--
2026.06
0.54442.52.62
2026.06
0.54142.51.99
2026.06
0.506401.81
2025.12
0.485--
2026.06
0.477364.26
2026.06
0.46538.51.84
2026.06
0.45734.51.93
2026.06
0.42432.53.82
2026.06
0.42337.52.99
2026.06
0.41232.52.28
2026.06
0.40731.82.7
2026.06
0.385302.29
2026.06
0.30729.83.12
2025.12
0.243--
2025.09
0.2375--
2026.06
0.2237.2-
0.2146--
2025.12
0.154--
2026.06
0.15410.41.44
2025.12
0.148--
2026.06
0.1386.72.37
2025.09
0.1286--
2025.12
0.122--
2026.06
0.1065.6-
2025.12
0.097--
2026.06
0.0977.2-