Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Trajectory-level safety evaluation on R-judge (test)

95.2Accuracy

Gemini-3-Flash

38.41653.15867.982.642Jan 26, 2026Feb 15, 2026Mar 8, 2026Mar 29, 2026Apr 18, 2026May 9, 2026May 30, 2026
Updated 1d ago

Evaluation Results

MethodLinks
2026.01
95.298.692.395.3
2026.01
94.39495.394.7
2026.05
93.188.2497.1693.75
2026.05
91.9588.0797.7892.63
2026.01
91.8889892.7
2026.01
91.788.297.392.5
2026.01
90.887.19791.8
2026.05
90.889.3693.3491.31
2026.01
89.595.883.989.5
2026.05
87.3696.9688.8987.91
2026.05
86.2178.9596.0488.24
2026.05
86.0582.1493.1887.23
2026.01
85.180.694.687
2026.05
85.0688.182.4285.06
2026.01
8174.198.784.6
2026.05
80.4675.9392.0182.83
2026.01
78.271.697.382.5
2026.05
72.4167.2191.2677.36
2026.01
68.473.862.467.6
2026.01
68.477.556.765.5
2026.01
63.871.956.463.2
2026.01
61.27346.456.7
2026.05
57.1256.777.4565.47
2026.05
56.7662.9342.1650.49
2026.05
55.1762.9926.1436.95
2026.01
54.460.240.648.5
2026.01
54.354.187.967
2026.01
53.753.310069.5
2026.01
52.557.140.347.2
2026.01
47.710012
2026.01
47.710012
2026.01
40.625.45.59