Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Utterance-level User Simulation on Chinese User Simulation Dataset
Loading...
69.92
AI Probability
UserLM
29.6304
40.0902
50.55
61.0098
Apr 15, 2026
AI Probability
Style Similarity
AVA Score
Context Relevance
Response Fidelity
Goal Control
Linguistic Naturalness
Updated 3d ago
Evaluation Results
Method
Method
Links
AI Probability
Style Similarity
AVA Score
Context Relevance
Response Fidelity
Goal Control
Linguistic Naturalness
UserLM
Model=UserLM
2026.04
69.92
58.88
55.38
51.11
52.89
47.42
56.55
GPT-4o
Model=GPT-4o
2026.04
45.5
71.81
62.79
88.91
92.21
83.04
92.14
Qwen3-8B
Model=Qwen3-8B
2026.04
44.45
73.07
64.69
86.44
91.02
78.88
89.76
USP
Model=USP
2026.04
43.14
66.44
59.25
63.84
68.74
57.79
73.14
Muse (w/o RL)
Reinforcement Learning...
2026.04
37.98
76.06
64.8
90.46
95.91
85.3
93.89
Muse
Reinforcement Learning...
2026.04
31.18
75.34
64.89
91.96
97.76
87.63
96.2
Feedback
Search any
task
Search any
task