Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
User Simulation on Instruments
Loading...
38.12
F1 Score
CoARS
9.3224
16.7987
24.275
31.7513
Apr 11, 2026
F1 Score
Updated 5d ago
Evaluation Results
Method
Method
Links
F1 Score
CoARS
Backbone=Qwen3-8B
2026.04
38.12
iAgent
Backbone=GPT-5.4-mini
2026.04
36.84
AFL
Backbone=GPT-5.4-mini
2026.04
34.84
iAgent
Backbone=Qwen3-8B
2026.04
31.46
CoARS
Backbone=Qwen3-4B
2026.04
28.54
RecoWorld
Backbone=Qwen3-8B
2026.04
26.64
iAgent
Backbone=Qwen3-4B
2026.04
26.42
AFL
Backbone=Qwen3-8B
2026.04
23.84
AFL
Backbone=Qwen3-4B
2026.04
19.84
Reflexion
Backbone=GPT-5.4-mini
2026.04
14.56
Reflexion
Backbone=Qwen3-8B
2026.04
12.84
RecoWorld
Backbone=Qwen3-4B
2026.04
12.56
Reflexion
Backbone=Qwen3-4B
2026.04
10.43
Feedback
Search any
task
Search any
task