Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multiple

Benchmarks

Task NameDataset NameSOTA ResultTrend
Video UnderstandingMultiple Aggregate
Average Score69.8
18
Generalist Multi-task EvaluationMultiple (ImageNet-1K, COCO)
Mean Delta-11.8
13
Factuality DetectionMultiple TriviaQA, HotpotQA, CSQA
Average AUROC72.9
4
Code GenerationMultiple
Score78.51
3
Controllable Language GenerationMultiple Distributional Constraint
Ctrl0.95
3
Showing 5 of 5 rows