Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Personalized Dialogue Generation on Session-level evaluation dataset 1.0 (test)
Loading...
83.78
Personalization Consistency
Muse
30.8024
44.5562
58.31
72.0638
Apr 15, 2026
Personalization Consistency
Goal Effectiveness
Dialogue Coherence
Constraint Compliance
Average Score
Updated 2d ago
Evaluation Results
Method
Method
Links
Personalization Consistency
Goal Effectiveness
Dialogue Coherence
Constraint Compliance
Average Score
Muse
RL Stage=Included
2026.04
83.78
79.67
86.85
87.39
84.42
GPT-4o
2026.04
76.25
75.86
81.87
78.34
78.08
Qwen3-8B
2026.04
74.15
62.56
71.59
71.81
70.03
Muse (w/o RL)
RL Stage=Excluded
2026.04
68.13
64.14
72.73
68.25
68.31
USP
2026.04
42.66
58.82
76.92
40.38
54.69
UserLM
2026.04
32.84
52.08
63.36
29.42
44.42
Feedback
Search any
task
Search any
task