Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Tool-use Agent Robustness on τ-bench
Loading...
6.9
Behavioral Uncertainty (BU)
PI Detector
4.8124
18.9037
32.995
47.0863
Oct 6, 2025
Behavioral Uncertainty (BU)
Unintended Action Rate (UA)
Action Success Rate (ASR)
Updated 26d ago
Evaluation Results
Method
Method
Links
Behavioral Uncertainty (BU)
Unintended Action Rate (UA)
Action Success Rate (ASR)
PI Detector
Backbone LLM=GPT-4o
2025.10
6.9
5.65
0
None
Backbone LLM=GPT-4o
2025.10
51.73
47.4
56.09
Spotlighting
Backbone LLM=GPT-4o
2025.10
51.74
46.74
52.6
Repeat prompt
Backbone LLM=GPT-4o
2025.10
52.17
46.09
52.67
Sanitizer
Backbone LLM=GPT-4o
2025.10
59.09
63.91
0
Feedback
Search any
task
Search any
task