DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process
About
Large Language Models (LLMs) are increasingly utilized in scientific research assessment, particularly in automated paper review. However, existing LLM-based review systems face significant challenges, including limited domain expertise, hallucinated reasoning, and a lack of structured evaluation. To address these limitations, we introduce DeepReview, a multi-stage framework designed to emulate expert reviewers by incorporating structured analysis, literature retrieval, and evidence-based argumentation. Using DeepReview-13K, a curated dataset with structured annotations, we train DeepReviewer-14B, which outperforms CycleReviewer-70B with fewer tokens. In its best mode, DeepReviewer-14B achieves win rates of 88.21\% and 80.20\% against GPT-o1 and DeepSeek-R1 in evaluations. Our work sets a new benchmark for LLM-based paper review, with all resources publicly available. The code, model, dataset and demo have be released in http://ai-researcher.net.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automated Peer Review Evaluation | DeepReview-13K 1.0 (test) | H-Max Technical Accuracy5.19 | 30 | |
| Scientific question generation | IntelliReward (test) | Effort0.00e+0 | 19 | |
| AI Peer Review | PaperAudit-Dataset ICML branch | Novelty7.9 | 18 | |
| Automated Peer Review | DeepReview-13K 2025 (test) | Technical Accuracy Win93.2 | 14 | |
| Automated Peer Review | DeepReview-13K (test) | Technical Accuracy Win (%)1 | 10 | |
| Coverage-based Alignment | ICLR 50 submissions 2026 | Str-Cov84.4 | 3 | |
| Score-based Alignment | ICLR 2026 (50 submissions) | R-MSE0.16 | 3 |