Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-turn conversation performance on Average
Loading...
94.7
Avg Performance
Full
46.652
59.126
71.6
84.074
Feb 7, 2026
Avg Performance
Reliability
Updated 1mo ago
Evaluation Results
Method
Method
Links
Avg Performance
Reliability
Full
Model=DeepSeek-v3.2-Th...
2026.02
94.7
88.7
Full
Model=GPT-5.2, Setting...
2026.02
92.7
85.4
Full
Model=GPT-4o-mini, Set...
2026.02
86.9
83.2
Experience-Driven Mediator
Model=DeepSeek-v3.2-Th...
2026.02
81.9
69.9
Experience-Driven Mediator
Model=GPT-4o-mini, Set...
2026.02
73.9
68.8
Experience-Driven Mediator
Model=GPT-5.2, Setting...
2026.02
72.6
63.5
Sharded
Model=DeepSeek-v3.2-Th...
2026.02
60.8
56.2
Sharded
Model=GPT-4o-mini, Set...
2026.02
53.6
54.2
Sharded
Model=GPT-5.2, Setting...
2026.02
48.5
46.8
Feedback
Search any
task
Search any
task