Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Coding

Benchmarks

Task NameDataset NameSOTA ResultTrend
Code GenerationCoding Eval+ LiveCode (test)
Eval+ Score87.2
32
Response correctness and completeness evaluationCoding
F1 Score85
32
Multi-Agent System PerformanceCoding
TS Score65
16
CodingCoding (val)
Pass@16100
16
ReasoningCoding
Normalized Score102.1
9
Prompt Injection DetectionCoding Direct Prompt Injection
FPR0
7
Code GenerationCoding Gender (test)
Cor (%)40
5
Code GenerationCoding Race (test)
Correctness Rate57
5
Agentic CodingCoding unseen tasks (test)
Pass@129.2
3
CodingCoding Hard
Baseline Score36.67
1
CodingCoding Medium
Baseline Score31.96
1
CodingCoding Easy
Baseline51.14
1
Showing 12 of 12 rows