Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Hard

Benchmarks

Task NameDataset NameSOTA ResultTrend
Jailbreak DefenseHard (H)
FPR0
12
ClassificationHARD (test)
Accuracy97.77
8
Online LearningHARD
Latency (s)0.2516
8
Reasoning over Large Structured ContextHard
ReasoningJudge Score5
4
Joint Audio-Video GenerationHard (test)
Sync-C6.12
4
Online Bin PackingHard28-R
Gap Percentage8.06
4
Showing 6 of 6 rows