Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Ours

Benchmarks

Task NameDataset NameSOTA ResultTrend
Tool-use ReasoningOurs (test)
Solve Precision (P)52.78
27
Causal DiscoveryOurs Noisy
AUROC82.3
9
Causal DiscoveryOurs Original
AUROC0.821
9
Instruction Following EvaluationOurs hard seed data
Score56.73
5
Language DetoxificationOurs (test)
Overall Offensiveness Score1.145
5
Harmful content detectionOurs trolling-oriented synthetic
Accuracy19.88
4
Harmful content detectionOurs CADD-based synthetic
Accuracy65.55
4
Makeup TransferOurs (test)
FID11.67
4
Radar Human Pose EstimationOurs
MPJPE (cm)6.425
1
Differential DiagnosisOurs Auxiliary
Top-5 Accuracy80
1
Fine-grained Score AccuracyOurs
Exact Accuracy70.56
1
Showing 11 of 11 rows