Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Trajectory-level safety evaluation on ATBench (test)

0.928Accuracy

AgentDoG-Qwen3-4B

0.478720.595360.7120.82864Jan 26, 2026
Updated 3mo ago

Evaluation Results

MethodLinks
2026.01
0.9280.9050.9560.93
2026.01
0.90.8470.9760.907
2026.01
0.8760.8090.9840.888
2026.01
0.8740.8210.9560.884
2026.01
0.8720.9270.8080.863
2026.01
0.8460.9940.6960.819
2026.01
0.760.9010.5840.709
2026.01
0.7560.9850.520.681
2026.01
0.7380.6870.8760.77
2026.01
0.630.9330.280.431
2026.01
0.6160.6410.5280.579
2026.01
0.5940.6580.3920.491
2026.01
0.58110.1640.282
2026.01
0.5690.6260.3480.447
2026.01
0.55310.1080.195
2026.01
0.53310.0680.127
2026.01
0.5110.80.0320.062
2026.01
0.5090.7780.0280.054
2026.01
0.4990.50.4120.452
2026.01
0.4960.4980.9920.663