Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
RO reformulation on Hard (Out-of-Distribution)
Loading...
94.8
Accuracy
AutoREM
69.84
76.32
82.8
89.28
May 12, 2026
Accuracy
Output Tokens
Updated 21d ago
Evaluation Results
Method
Method
Links
Accuracy
Output Tokens
AutoREM
Base LLM=DeepSeek-V4-F...
2026.05
94.8
6,944
Max Thinking
Base LLM=DeepSeek-V4-F...
2026.05
83.3
14,902
Expert Prompt
Base LLM=DeepSeek-V4-F...
2026.05
83.3
7,549
ACE
Base LLM=DeepSeek-V4-F...
2026.05
81.3
5,238
ReasoningBank
Base LLM=DeepSeek-V4-F...
2026.05
80.2
8,089
Base LLM
Base LLM=DeepSeek-V4-F...
2026.05
70.8
9,026
Feedback
Search any
task
Search any
task