User Simulation

Benchmarks

Dataset Name	SOTA Method	Metric
Synthetic Exposure Overall	Llama-3.2-3B-Instruct +SFT+DPO	Accuracy (%)55	19	3mo ago
MIND Real Exposure		Accuracy34.8	19	3mo ago
τ-USI retail airline tasks (out-of-distribution)		Conv Score87.4	16	1mo ago
UserLLM	DITTO	Primary Metric Score93	15	1mo ago
Instruments	CoARS	F1 Score38.12	13	3mo ago
MovieLens	CoARS	F1 Score29.74	13	3mo ago
LastFM	CoARS	F1 Score31.45	13	3mo ago
Adversarial User Simulation Dataset Turn-level (test)		Robotics Score95	11	4mo ago
User Simulation Dataset Session-level (test)	UserLM-R1-32B	Role Score95.21	11	4mo ago
WildChat (In-Distribution)		Turn Count8.62	8	2mo ago
MirrorBench	DITTO	Realism Score (LLM-judge)0.713	6	2mo ago
ConvApparel Efficient Matchmaker Ultra-terse Fast V2 (held-out agent)	Static, Agnostic	Avg. Words per Turn25.4	5	2mo ago
ConvApparel Domain Expert Academic Verbose V2 (held-out agent)	Dynamic, Aware	Average Words per Turn12.1	5	2mo ago
HUMANUAL (test)	HUMANLM	News Score40.58	3	4mo ago

Showing 14 of 14 rows