Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Policy-trajectory compliance evaluation on POLICYGUARDBENCH
Loading...
88.83
F1 Score
LLAMA-3.3-70B-INSTRUCT
81.706
83.5555
85.405
87.2545
Oct 3, 2025
F1 Score
EA-F1 Score
Updated 14d ago
Evaluation Results
Method
Method
Links
F1 Score
EA-F1 Score
LLAMA-3.3-70B-INSTRUCT
SIZE=70B, LATENCY=305....
2025.10
88.83
2.9125
GEMMA-3-12B-IT
SIZE=12B, LATENCY=51.3...
2025.10
87.73
17.1014
POLICYGUARD-4B
SIZE=4B, LATENCY=22.5,...
2025.10
87.59
38.9289
QWEN3-235B-A22B-INSTRUCT-2507
SIZE=235B, LATENCY=364...
2025.10
86.9
0.2387
CLAUDE-SONNET-4
LATENCY=1238.0
2025.10
86.78
0.701
QWEN2.5-72B-INSTRUCT
SIZE=72B, LATENCY=205....
2025.10
86.07
4.1985
GEMMA-3-27B-IT
SIZE=27B, LATENCY=73.6...
2025.10
85.2
11.5761
GEMINI-1.5-PRO
LATENCY=596.1
2025.10
85.02
1.4263
DEEPSEEK-V3.1 (NON-THINKING)
SIZE=685B, LATENCY=3270.0
2025.10
84.07
0.2571
LLAMA-4-SCOUT-17B-16E-INSTRUCT
SIZE=109B, LATENCY=265...
2025.10
81.98
3.0936
Feedback
Search any
task
Search any
task