Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-turn conversation performance on Database
Loading...
96.3
Avg Performance
Full
41.492
55.721
69.95
84.179
Feb 7, 2026
Avg Performance
Reliability
Updated 1mo ago
Evaluation Results
Method
Method
Links
Avg Performance
Reliability
Full
Model=GPT-5.2, Setting...
2026.02
96.3
95.9
Full
Model=DeepSeek-v3.2-Th...
2026.02
94.4
88.8
Full
Model=GPT-4o-mini, Set...
2026.02
92.5
93.5
Experience-Driven Mediator
Model=DeepSeek-v3.2-Th...
2026.02
67.3
55.9
Experience-Driven Mediator
Model=GPT-4o-mini, Set...
2026.02
65.3
59.8
Experience-Driven Mediator
Model=GPT-5.2, Setting...
2026.02
64.5
56.7
Sharded
Model=GPT-4o-mini, Set...
2026.02
52.5
54.2
Sharded
Model=GPT-5.2, Setting...
2026.02
49.4
48
Sharded
Model=DeepSeek-v3.2-Th...
2026.02
43.6
54.2
Feedback
Search any
task
Search any
task