Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Instruction Following on RecRM-Bench
Loading...
72.66
Accuracy
Ours
24.8304
37.2477
49.665
62.0823
May 12, 2026
Accuracy
F1 Score
Updated 21d ago
Evaluation Results
Method
Method
Links
Accuracy
F1 Score
Ours
2026.05
72.66
72.4
LongCat-Flash-Chat
Thinking=false
2026.05
64.84
65.19
GPT-4.1
2026.05
55.47
58.33
LongCat-Flash-Thinking
Thinking=true
2026.05
43.75
48.16
Qwen3-Max
Thinking=false
2026.05
36.8
40.33
Deepseek-V3.2
Thinking=true
2026.05
35.29
39.5
Deepseek-V3.2
Thinking=false
2026.05
30.47
29.94
Qwen3-Max
Thinking=true
2026.05
26.67
26.64
Feedback
Search any
task
Search any
task