Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-hop Reasoning on MuSiQue (F1)
Loading...
73.3
F1 Score
TOTAL
21.0192
34.5921
48.165
61.7379
Oct 8, 2025
F1 Score
Updated 1mo ago
Evaluation Results
Method
Method
Links
F1 Score
TOTAL
Model=Claude, Promptin...
2025.10
73.3
TOTAL
Model=Gemini, Promptin...
2025.10
72.86
CIC + COT
Model=Gemini, Promptin...
2025.10
67.17
CIC
Model=Gemini, Promptin...
2025.10
66.54
TOTAL
Model=GPT, Prompting M...
2025.10
66.38
CIC + COT
Model=GPT, Prompting M...
2025.10
65.11
CIC + COT
Model=Claude, Promptin...
2025.10
65.07
CIC
Model=Claude, Promptin...
2025.10
63.87
CIC
Model=GPT, Prompting M...
2025.10
63.79
NAÏVE
Model=GPT, Prompting M...
2025.10
32.43
COT
Model=GPT, Prompting M...
2025.10
32.39
COT
Model=Claude, Promptin...
2025.10
28.1
NAÏVE
Model=Claude, Promptin...
2025.10
27.57
NAÏVE
Model=Gemini, Promptin...
2025.10
25.48
COT
Model=Gemini, Promptin...
2025.10
23.03
Feedback
Search any
task
Search any
task