Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agentic Task Evaluation on τ2-Bench Retail (avg@4, pass@4)
Loading...
69.7
Avg@4
Snapshot
63.7304
65.2802
66.83
68.3798
May 12, 2026
Avg@4
Pass@4
Updated 21d ago
Evaluation Results
Method
Method
Links
Avg@4
Pass@4
Snapshot
Backbone=Qwen3-30B-A3B
2026.05
69.7
92.1
PPO-EWMA
Backbone=Qwen3-30B-A3B
2026.05
67.82
92.1
Snapshot
Backbone=Qwen3-4B
2026.05
66.23
89.47
Linear_prox
Backbone=Qwen3-30B-A3B
2026.05
65.8
87.7
PPO-EWMA
Backbone=Qwen3-4B
2026.05
65.72
90.35
Decoupled PPO
Backbone=Qwen3-30B-A3B
2026.05
65.43
89.47
Linear_prox
Backbone=Qwen3-4B
2026.05
64.4
86.84
Decoupled PPO
Backbone=Qwen3-4B
2026.05
63.96
88.6
Feedback
Search any
task
Search any
task