Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-turn conversation performance on Actions
Loading...
93.7
Average Performance
Full
33.276
48.963
64.65
80.337
Feb 7, 2026
Average Performance
Reliability
Updated 1mo ago
Evaluation Results
Method
Method
Links
Average Performance
Reliability
Full
Model=GPT-4o-mini, Set...
2026.02
93.7
92.4
Full
Model=DeepSeek-v3.2-Th...
2026.02
92.2
88.6
Full
Model=GPT-5.2, Setting...
2026.02
90.2
93.2
Experience-Driven Mediator
Model=DeepSeek-v3.2-Th...
2026.02
88
71.6
Experience-Driven Mediator
Model=GPT-4o-mini, Set...
2026.02
85.7
81.2
Experience-Driven Mediator
Model=GPT-5.2, Setting...
2026.02
76.2
65.2
Sharded
Model=GPT-4o-mini, Set...
2026.02
45.5
60
Sharded
Model=DeepSeek-v3.2-Th...
2026.02
42.3
48.6
Sharded
Model=GPT-5.2, Setting...
2026.02
35.6
46.6
Feedback
Search any
task
Search any
task