Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Factual Consistency on RecRM-Bench
Loading...
70.71
Accuracy (%)
Ours
40.238
48.149
56.06
63.971
May 12, 2026
Accuracy (%)
F1 Score (%)
Updated 21d ago
Evaluation Results
Method
Method
Links
Accuracy (%)
F1 Score (%)
Ours
2026.05
70.71
82.84
LongCat-Flash-Chat
Thinking=false
2026.05
67.34
80.48
LongCat-Flash-Thinking
Thinking=true
2026.05
64.98
78.78
GPT-4.1
2026.05
62.96
77.27
Qwen3-Max
Thinking=true
2026.05
56.57
72.26
Qwen3-Max
Thinking=false
2026.05
53.2
69.45
Deepseek-V3.2
Thinking=false
2026.05
43.43
60.56
Deepseek-V3.2
Thinking=true
2026.05
41.41
58.57
Feedback
Search any
task
Search any
task