LiveBench

Benchmarks

Task Name	Dataset Name	SOTA Result
Reasoning	LiveBench Reasoning	Accuracy92	80
General Reasoning	LiveBench	Accuracy53.47	55
Reasoning	LiveBench	Accuracy31.1	40
Ensemble Committee Selection	LiveBench (test)	Mean θtest99.07	34
Code Generation	LiveBench (test)	Sig. Score56.5	26
General LLM Benchmarking	LiveBench	Official Score49.6	24
Code Generation	LiveBench	Avg@842.9	22
Code Generation	LiveBench	Signal58.7	21
General Language Modeling	LiveBench	Accuracy31.1	17
Mathematical Reasoning	LiveBench Math	Initial Task Score58.1	16
Reasoning	LiveBench	Accuracy33	16
General Evaluation	LiveBench	Accuracy46.83	15
Coding	LiveBench	Accuracy40.23	15
Language Model Evaluation	LiveBench	Average Score72.8	12
Code generation	LiveBench	Accuracy31.1	12
Mathematical Reasoning	LiveBench	Accuracy53.6	12
Code generation	LiveBench	Pass@1020.39	8
Single-event Scene Revisit (Different Pose)	LiveBench	DINO Feature Similarity (FG)0.691	8
Single-event Scene Revisit (Same Pose)	LiveBench	PSNR (Background)20.132	8
Instruction Following	LiveBench IFEval	Quality100	6
Instruct Following	LiveBench	Average Instruction Following Score55.39	6
General Evaluation	LiveBench 1125	Score52.1	6
General Tasks	LiveBench 2024-11-25	Accuracy75.9	5
Mathematical Reasoning	LiveBench Math (test)	Score51.95	5
Examination	LiveBench 2024-11-25	Score70.79	5

Showing 25 of 35 rows