Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
User Simulation on MirrorBench
Loading...
0.713
Realism Score (LLM-judge)
DITTO
0.3438
0.43965
0.5355
0.63135
May 19, 2026
Realism Score (LLM-judge)
Updated 13d ago
Evaluation Results
Method
Method
Links
Realism Score (LLM-judge)
DITTO
Backbone=Qwen3-VL-8B-I...
2026.05
0.713
GRPO
Backbone=Qwen3-VL-8B-I...
2026.05
0.683
Qwen3-VL-8B-Instruct
Role=Base
2026.05
0.547
GPT-5.4
2026.05
0.536
HumanLM-8B
2026.05
0.481
GPT-5-nano
2026.05
0.358
Feedback
Search any
task
Search any
task