Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-hop QA on DROP (test)
Loading...
87.9
F1 Score
DenoiseFlow
67.516
72.808
78.1
83.392
Feb 28, 2026
F1 Score
Updated 1mo ago
Evaluation Results
Method
Method
Links
F1 Score
DenoiseFlow
Backbone LLM=GPT-4o-mi...
2026.02
87.9
JudgeFlow
Backbone LLM=GPT-4o-mi...
2026.02
86.1
MermaidFlow
Backbone LLM=GPT-4o-mi...
2026.02
85.5
MaAS
Backbone LLM=GPT-4o-mi...
2026.02
83.1
DyLAN
Backbone LLM=GPT-4o-mi...
2026.02
82.2
GPTSwarm
Backbone LLM=GPT-4o-mi...
2026.02
81
AFlow
Backbone LLM=GPT-4o-mi...
2026.02
80.6
LLM-Blender
Backbone LLM=GPT-4o-mi...
2026.02
80.4
CoT SC
Backbone LLM=GPT-4o-mi...
2026.02
78.8
CoT
Backbone LLM=GPT-4o-mi...
2026.02
78.5
LLM-Debate
Backbone LLM=GPT-4o-mi...
2026.02
78.1
ADAS
Backbone LLM=GPT-4o-mi...
2026.02
76.6
Self-Refine
Backbone LLM=GPT-4o-mi...
2026.02
70.2
IO
Backbone LLM=GPT-4o-mi...
2026.02
68.3
Feedback
Search any
task
Search any
task