Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agent Monitoring on CUA-SHADE-Arena (test)
Loading...
89.1
Accuracy
Agent-ToM
54.988
63.844
72.7
81.556
May 22, 2026
Accuracy
F1 Score
Recall
Precision
Updated 9d ago
Evaluation Results
Method
Method
Links
Accuracy
F1 Score
Recall
Precision
Agent-ToM
Variant=external memor...
2026.05
89.1
83.1
71
100
LLM-Judge
Architecture=Single-Pa...
2026.05
88.3
81.5
68.8
100
Sequential Ensemble
Architecture=Ensemble-...
2026.05
86.6
80.5
73.3
89.1
Agent-ToM
Variant=Prompt-level l...
2026.05
84
74
60
96
ToM
Architecture=Structure...
2026.05
80.8
66.6
51
95.8
Agent-ToM
Variant=external memor...
2026.05
79.2
61.5
44.4
100
Async Ensemble
Architecture=Ensemble-...
2026.05
56.3
60.6
90
45.4
Feedback
Search any
task
Search any
task