Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-turn conversation performance on Code
Loading...
98.3
Avg Performance
Full
37.044
52.947
68.85
84.753
Feb 7, 2026
Avg Performance
Reliability
Updated 1mo ago
Evaluation Results
Method
Method
Links
Avg Performance
Reliability
Full
Model=DeepSeek-v3.2-Th...
2026.02
98.3
95.9
Experience-Driven Mediator
Model=DeepSeek-v3.2-Th...
2026.02
86.1
84.8
Full
Model=GPT-5.2, Setting...
2026.02
83.2
84
Sharded
Model=DeepSeek-v3.2-Th...
2026.02
78.4
65.7
Full
Model=GPT-4o-mini, Set...
2026.02
74.2
76
Experience-Driven Mediator
Model=GPT-5.2, Setting...
2026.02
69.1
70.1
Experience-Driven Mediator
Model=GPT-4o-mini, Set...
2026.02
66.9
63.6
Sharded
Model=GPT-4o-mini, Set...
2026.02
51.4
57
Sharded
Model=GPT-5.2, Setting...
2026.02
39.4
44
Feedback
Search any
task
Search any
task