Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

R-judge

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agent SafetyR-Judge
Accuracy97.3
92
Trajectory-level safety evaluationR-judge (test)
Accuracy95.2
32
Binary safe/unsafe classificationR-Judge (test)
Accuracy57.8
4
Showing 3 of 3 rows