Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Trajectory Auditing on AFTRAJ-2K (D_safe D_unsafe)
Loading...
2.37
False Alarm Rate (FAR)
AgentForesight-7B
-1.4404
24.2798
50
75.7202
May 9, 2026
False Alarm Rate (FAR)
Step Accuracy (Step-Acc)
Updated 23d ago
Evaluation Results
Method
Method
Links
False Alarm Rate (FAR)
Step Accuracy (Step-Acc)
AgentForesight-7B
Model Category=Ours
2026.05
2.37
59.51
DeepSeek-V4-Pro
Model Category=Proprie...
2026.05
43.2
53.99
Qwen2.5-7B-Instruct
Model Category=Open-So...
2026.05
46.15
36.2
Qwen3-8B
Model Category=Open-So...
2026.05
56.8
38.04
DeepSeek-V4-Flash
Model Category=Proprie...
2026.05
59.76
47.24
Gemini-3-Flash
Model Category=Proprie...
2026.05
67.86
38.04
Claude-Haiku-4.5
Model Category=Proprie...
2026.05
68.64
33.13
GPT-4.1
Model Category=Proprie...
2026.05
85.8
38.04
Llama3.2-3B
Model Category=Open-So...
2026.05
90.53
20.86
Gemma3-4B
Model Category=Open-So...
2026.05
97.63
10.43
Feedback
Search any
task
Search any
task