Share your thoughts, 1 month free Claude Pro on usSee more

Sabotage detection on BigCodeBench Sabotage (reasoning LLM attacker)

0.87log-AUROC

Extract-and-Evaluate

Updated 4mo ago

Evaluation Results

Method	Links
Extract-and-Evaluate 2026.01		0.87	-
Action-only 2026.01		0.73	-
Extract-and-Evaluate 2026.01		0.73	-
Action-only 2026.01		0.61	-
CoT-only 2026.01		0.58	-
CoT+action 2026.01		0.55	-
CoT+action 2026.01		0.48	-
CoT-only 2026.01		0.45	-