Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

held-out

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multi-path Speculative DecodingHeld-out (test)
Average Block Efficiency6.84
24
BargainingHeld-Out (test)
Reward0.7664
16
Query routing and tool-calling accuracy evaluationHeld-out 12,282 examples (test)
Accuracy89.39
15
Tone MappingHeld-out (test)
PSNR40.59
6
Clinical case generationHeld-out (test)
BLEU-418.98
6
Selective ClassificationHeld-out (test)
Coverage100
5
Pairwise preference rankingHeld-out
ELO Score1,187
5
License Plate Recognitionheld-out (test)
Plate Accuracy92.3
5
Event-level market-impact predictionHeld-out 2021-2023 (test)
Non-neutral F135.6
4
Binary-level classificationheld-out (test)
Accuracy98.4
4
binary classificationheld-out n=2,332 (test)
Accuracy99.61
4
Supply chain disruption forecastingHeld-out (test)
Brier Score0.0791
4
Joint Attention detectionHeld-out (test)
Accuracy77.6
3
Knowledge Conflict ResolutionHeld-out 30 q
Accuracy76.7
3
Showing 14 of 14 rows