Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
User Simulation Behavioral Alignment on tau2-bench Retail + Airline (test)
Loading...
95.8
HL Score
Humans
8.96
31.505
54.05
76.595
May 13, 2026
HL Score
Coverage
Overall Score
Dimension 1 Score (D1)
Dimension 2 Score (D2)
Dimension 3 Score (D3)
Dimension 4 Score (D4)
User Simulation ID 1-4 Score
Updated 20d ago
Evaluation Results
Method
Method
Links
HL Score
Coverage
Overall Score
Dimension 1 Score (D1)
Dimension 2 Score (D2)
Dimension 3 Score (D3)
Dimension 4 Score (D4)
User Simulation ID 1-4 Score
Humans
User Simulator=DeepSee...
2026.05
95.8
62.3
79
94.9
97.8
88.6
92.2
93.4
PPol: Evolved
User Simulator=DeepSee...
2026.05
57
65.7
61.4
65.4
93.7
84
54.8
74.5
PPol: Initial
User Simulator=DeepSee...
2026.05
41
32.3
36.7
42.7
85.3
43.8
33.2
51.2
DP Personas
User Simulator=DeepSee...
2026.05
29.2
34
31.6
37.1
75.9
49.8
32.9
48.9
Base-simulator
User Simulator=DeepSee...
2026.05
12.3
14.6
13.5
32
88.7
53.4
54.4
57.1
Feedback
Search any
task
Search any
task