Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agent Interaction on FTWP (val)
Loading...
43.66
Success Rate
GPT-4o
15.1016
22.5158
29.93
37.3442
Apr 13, 2026
Success Rate
Updated 4d ago
Evaluation Results
Method
Method
Links
Success Rate
GPT-4o
Backbone=GPT-4o
2026.04
43.66
Gemma2-9B + MISE
Backbone=Gemma2-9B-Ins...
2026.04
40.14
Gemma2-9B + PRM
Backbone=Gemma2-9B-Ins...
2026.04
37.77
Gemma2-9B + PPO
Backbone=Gemma2-9B-Ins...
2026.04
36.95
LLaMA3-8B + MISE
Backbone=LLaMA3-8B-Ins...
2026.04
34.71
Qwen2-7B + MISE
Backbone=Qwen2-7B-Inst...
2026.04
33.8
Qwen2-7B + PRM
Backbone=Qwen2-7B-Inst...
2026.04
31.8
Qwen2-7B + PPO
Backbone=Qwen2-7B-Inst...
2026.04
30.69
LLaMA3-8B + PRM
Backbone=LLaMA3-8B-Ins...
2026.04
28.45
LLaMA3-8B + PPO
Backbone=LLaMA3-8B-Ins...
2026.04
27.87
GPT-4o-mini
Backbone=GPT-4o-mini
2026.04
27.26
LLaMA3-8B + RFT
Backbone=LLaMA3-8B-Ins...
2026.04
24.75
LLaMA3-8B + online DPO
Backbone=LLaMA3-8B-Ins...
2026.04
22.54
LLaMA3-8B
Backbone=LLaMA3-8B-Ins...
2026.04
21.73
LLaMA3-8B + ReAct
Backbone=LLaMA3-8B-Ins...
2026.04
18.81
Gemma2-9B
Backbone=Gemma2-9B-Ins...
2026.04
16.3
Qwen2-7B
Backbone=Qwen2-7B-Inst...
2026.04
16.2
Feedback
Search any
task
Search any
task