Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-hop Reasoning on Housing QA
Loading...
82.67
Accuracy
TOTAL
56.67
63.42
70.17
76.92
Oct 8, 2025
Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
Accuracy
TOTAL
Model=Claude, Promptin...
2025.10
82.67
CIC + COT
Model=Claude, Promptin...
2025.10
75
TOTAL
Model=Gemini, Promptin...
2025.10
74.33
CIC
Model=Claude, Promptin...
2025.10
71.67
CIC + COT
Model=Gemini, Promptin...
2025.10
70.33
TOTAL
Model=GPT, Prompting M...
2025.10
70
CIC
Model=Gemini, Promptin...
2025.10
68.33
CIC + COT
Model=GPT, Prompting M...
2025.10
66
CIC
Model=GPT, Prompting M...
2025.10
64.33
COT
Model=GPT, Prompting M...
2025.10
61.33
NAÏVE
Model=Claude, Promptin...
2025.10
60.33
NAÏVE
Model=GPT, Prompting M...
2025.10
60.33
COT
Model=Gemini, Promptin...
2025.10
58.67
NAÏVE
Model=Gemini, Promptin...
2025.10
58
COT
Model=Claude, Promptin...
2025.10
57.67
Feedback
Search any
task
Search any
task