Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Ours

Benchmarks

Task NameDataset NameSOTA ResultTrend
Tool-use ReasoningOurs (test)
Solve Precision (P)52.78
27
Causal DiscoveryOurs Noisy
AUROC82.3
9
Causal DiscoveryOurs Original
AUROC0.821
9
Instruction Following EvaluationOurs hard seed data
Score56.73
5
Language DetoxificationOurs (test)
Overall Offensiveness Score1.145
5
Makeup TransferOurs (test)
FID11.67
4
Fine-grained Score AccuracyOurs
Exact Accuracy70.56
1
Showing 7 of 7 rows