Share your thoughts, 1 month free Claude Pro on usSee more

LLM behavior monitoring on SHADE-Arena (test)

75.2Accuracy

Agent-ToM (external memory, Stage 1–2)

Updated 2mo ago

Evaluation Results

Method	Links
Agent-ToM (external memory, Stage 1–2) 2026.05		75.2	78.2	74.8	81.8
Sequential Ensemble 2026.05		72.5	72.7	61.8	88.4
Agent-ToM (external memory, Stage 1) 2026.05		69.3	71.7	65.5	79.2
Agent-ToM (Prompt-level learning) 2026.05		69	68.6	57	85.9
Agent-Auditor (nearest neighbor retrieval) 2026.05		66.3	69.7	65.3	74.8
ToM (zeroshot) 2026.05		65.9	61.5	45.7	93.7
LLM-Judge (zeroshot) 2026.05		63.5	59.6	45.4	86.9
Async Ensemble (Majority Vote) 2026.05		62.5	76.1	99	61.7