Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Tool-use Agent Robustness on τ-bench

6.9Behavioral Uncertainty (BU)

PI Detector

4.812418.903732.99547.0863Oct 6, 2025
Updated 26d ago

Evaluation Results

MethodLinks
2025.10
6.95.650
2025.10
51.7347.456.09
2025.10
51.7446.7452.6
2025.10
52.1746.0952.67
2025.10
59.0963.910