Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mean

Benchmarks

Task NameDataset NameSOTA ResultTrend
General speculative decoding performanceMean (MT-Bench, HumanEval, GSM8K)
Average Acceptance Length (τ)6.52
112
General Reasoning and CodingMean GSM8K, HumanEval, MBPP
Speed4
26
Pixel-level manipulation detectionMEAN Across datasets
F1 Score72.8
20
Code GenerationMean Across MBPP, CodeAlpacaPy, HumanEval, LiveCodeBench
Speedup4.04
14
Medical Image ClassificationMean
Accuracy73
13
Visual Place RecognitionMean Across Datasets
R@180.9
12
Pick-and-placeMean Across T1, T2, T3
Mean Grasp Success Rate99
10
AI-generated video detectionMean Across Frontier Commercial Generators
Accuracy87.25
7
Mathematical Reasoning and Code GenerationMean (GSM8K, MATH, HumanEval, MBPP)
Accuracy52.06
7
Offline Reinforcement LearningMean Medium-Replay
Normalized Return76.45
7
Offline Reinforcement LearningMean Medium
Normalized Return71.33
7
Offline Reinforcement LearningMean Medium-Expert
Normalized Return98.5
7
Physically-based renderingMean All scenes
PSNR31.8
4
Mathematical ReasoningMean across benchmarks
Speedup2.12
2
Showing 14 of 14 rows