Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Evaluation dataset

Benchmarks

Task NameDataset NameSOTA ResultTrend
Tool-useEvaluation dataset
Accuracy51.98
20
Compositional GeneralizationEvaluation Dataset (Unseen Average)
Score42.86
18
Compositional GeneralizationEvaluation Dataset Seen Average
Score62.34
18
Compositional GeneralizationEvaluation Dataset Unseen (Fold 3)
Score0.4022
18
Compositional GeneralizationEvaluation Dataset (Fold 3 Seen)
Score66.69
18
Compositional GeneralizationEvaluation Dataset Unseen (Fold 2)
Score50
18
Compositional GeneralizationEvaluation Dataset (Fold 2 Seen)
Score63.63
18
Compositional GeneralizationEvaluation Dataset Unseen (Fold 1)
Score0.4818
18
Compositional GeneralizationEvaluation Dataset (Fold 1 Seen)
Score0.6191
18
Compositional GeneralizationEvaluation Dataset (Full)
Score0.6379
18
Malicious Package DetectionEvaluation Dataset
Accuracy99.5
11
Correlation analysis with ground truthEvaluation Dataset 2000 samples
Pearson Correlation Coefficient0.754
7
Global 3D EditingEvaluation dataset unseen 3D assets (test)
CLIP Similarity0.272
6
Local 3D EditingEvaluation dataset unseen 3D assets (test)
CLIP Similarity0.292
6
Image-to-3D GenerationEvaluation Dataset
FID34.251
2
Inconsistency detectionEvaluation dataset Full (4,556 skills)
Total Flagged Count487
1
Showing 16 of 16 rows