Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agent Interaction on FTWP (test)
Loading...
51.08
Success Rate
GPT-4o
11.5912
21.8431
32.095
42.3469
Apr 13, 2026
Success Rate
Updated 4d ago
Evaluation Results
Method
Method
Links
Success Rate
GPT-4o
Backbone=GPT-4o
2026.04
51.08
Gemma2-9B + MISE
Backbone=Gemma2-9B-Ins...
2026.04
32.94
GPT-4o-mini
Backbone=GPT-4o-mini
2026.04
31.63
Gemma2-9B + PRM
Backbone=Gemma2-9B-Ins...
2026.04
30.5
Gemma2-9B + PPO
Backbone=Gemma2-9B-Ins...
2026.04
29.83
LLaMA3-8B + MISE
Backbone=LLaMA3-8B-Ins...
2026.04
28.9
LLaMA3-8B + PRM
Backbone=LLaMA3-8B-Ins...
2026.04
25.61
Qwen2-7B + MISE
Backbone=Qwen2-7B-Inst...
2026.04
24.9
LLaMA3-8B + PPO
Backbone=LLaMA3-8B-Ins...
2026.04
24.48
Qwen2-7B + PRM
Backbone=Qwen2-7B-Inst...
2026.04
22.81
LLaMA3-8B + RFT
Backbone=LLaMA3-8B-Ins...
2026.04
22.32
Qwen2-7B + PPO
Backbone=Qwen2-7B-Inst...
2026.04
20.85
LLaMA3-8B + online DPO
Backbone=LLaMA3-8B-Ins...
2026.04
20.09
LLaMA3-8B + ReAct
Backbone=LLaMA3-8B-Ins...
2026.04
17.8
Gemma2-9B
Backbone=Gemma2-9B-Ins...
2026.04
17.3
LLaMA3-8B
Backbone=LLaMA3-8B-Ins...
2026.04
17.24
Qwen2-7B
Backbone=Qwen2-7B-Inst...
2026.04
13.11
Feedback
Search any
task
Search any
task