Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Code Generation on HumanEval+, MBPP+, and BigCodeBench Aggregate
Loading...
70.72
Average Score
Code-A1
50.4296
55.6973
60.965
66.2327
Mar 16, 2026
Average Score
Updated 1mo ago
Evaluation Results
Method
Method
Links
Average Score
Code-A1
Code LLM=Qwen2.5-Coder...
2026.03
70.72
Self-Play
Code LLM=Qwen2.5-Coder...
2026.03
70.39
Golden Tests
Code LLM=Qwen2.5-Coder...
2026.03
70.37
/
Code LLM=Qwen2.5-Coder...
2026.03
68.35
Code-A1
Code LLM=Qwen2.5-Coder...
2026.03
66.15
Golden Tests
Code LLM=Qwen2.5-Coder...
2026.03
65.14
Self-Play
Code LLM=Qwen2.5-Coder...
2026.03
64.67
/
Code LLM=Qwen2.5-Coder...
2026.03
60.84
Code-A1
Code LLM=Qwen2.5-Coder...
2026.03
56.95
Golden Tests
Code LLM=Qwen2.5-Coder...
2026.03
56.23
Self-Play
Code LLM=Qwen2.5-Coder...
2026.03
55.88
/
Code LLM=Qwen2.5-Coder...
2026.03
51.21
Feedback
Search any
task
Search any
task