Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rome

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agent Safety JudgmentROME Shortcut Decision-Making 1.0
F1-Score94.53
24
Agent Safety JudgmentROME Contextual Ambiguity 1.0
F1 Score91.92
24
Agent Safety JudgmentROME Implicit Risks 1.0 (IR)
F1-Score86.81
24
Agent Safety JudgmentROME Original 1.0 (Seed)
F1-Score94.53
24
Agent Safety JudgmentROME IR unsafe
F1 Score95.74
4
Agent Safety JudgmentROME CA unsafe subset
F1 Score97.35
4
Agent Safety JudgmentROME SDM unsafe
F1 Score100
4
Land Surface Temperature EstimationRome
RMSE2.088
2
Showing 8 of 8 rows