Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LiveBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
ReasoningLiveBench Reasoning
Accuracy92
80
General ReasoningLiveBench
Accuracy53.47
50
Ensemble Committee SelectionLiveBench (test)
Mean θtest99.07
34
Code GenerationLiveBench (test)
Sig. Score56.5
26
ReasoningLiveBench
Accuracy22.3
25
General LLM BenchmarkingLiveBench
Official Score49.6
24
Code GenerationLiveBench
Avg@842.9
22
Code GenerationLiveBench
Signal58.7
21
Mathematical ReasoningLiveBench Math
Initial Task Score58.1
16
ReasoningLiveBench
Accuracy33
16
General EvaluationLiveBench
Accuracy46.83
15
CodingLiveBench
Accuracy40.23
15
Mathematical ReasoningLiveBench
Accuracy53.6
12
Single-event Scene Revisit (Different Pose)LiveBench
DINO Feature Similarity (FG)0.691
8
Single-event Scene Revisit (Same Pose)LiveBench
PSNR (Background)20.132
8
Instruct FollowingLiveBench
Average Instruction Following Score55.39
6
General EvaluationLiveBench 1125
Score52.1
6
General TasksLiveBench 2024-11-25
Accuracy75.9
5
Mathematical ReasoningLiveBench Math (test)
Score51.95
5
ExaminationLiveBench 2024-11-25
Score70.79
5
General TasksLiveBench 0831
Accuracy0.57
5
ReasoningLiveBench (test)
Accuracy18.15
3
Showing 22 of 22 rows