Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Hard

Benchmarks

Task NameDataset NameSOTA ResultTrend
Jailbreak DefenseHard (H)
FPR0
12
ClassificationHARD (test)
Accuracy97.77
8
Online LearningHARD
Latency (s)0.2516
8
RO reformulationHard (Out-of-Distribution)
Accuracy94.8
6
Speech SeparationHard (test)
SI-SDR (dB)9.31
4
Reasoning over Large Structured ContextHard
ReasoningJudge Score5
4
Joint Audio-Video GenerationHard (test)
Sync-C6.12
4
Online Bin PackingHard28-R
Gap Percentage8.06
4
First Integral DiscoveryHard (test)
Accuracy63.7
2
Showing 9 of 9 rows