Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Monitoring on APPS (test)
Loading...
81.6
pAUC
FT-Completions-Randint, Exploit-Finder, High-Recall
71.8864
74.4082
76.93
79.4518
May 14, 2026
pAUC
Updated 16d ago
Evaluation Results
Method
Method
Links
pAUC
FT-Completions-Randint, Exploit-Finder, High-Recall
Rank=1, Monitors=FT-Co...
2026.05
81.6
FT-Completions-Randint, Reference-Compare, Trusted-Debate
Rank=2, Monitors=FT-Co...
2026.05
81.56
FT-Completions-Randint, FT-Completions-GPT-5, Exploit-Finder
Rank=3, Monitors=FT-Co...
2026.05
81.54
Median 3-monitor ensemble
Monitors=Median 3-moni...
2026.05
76.9
3× Baseline ensemble
Monitors=3× Baseline e...
2026.05
76.21
Baseline monitor
Monitors=Baseline monitor
2026.05
72.26
Feedback
Search any
task
Search any
task