Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process

About

Large Language Models (LLMs) are increasingly utilized in scientific research assessment, particularly in automated paper review. However, existing LLM-based review systems face significant challenges, including limited domain expertise, hallucinated reasoning, and a lack of structured evaluation. To address these limitations, we introduce DeepReview, a multi-stage framework designed to emulate expert reviewers by incorporating structured analysis, literature retrieval, and evidence-based argumentation. Using DeepReview-13K, a curated dataset with structured annotations, we train DeepReviewer-14B, which outperforms CycleReviewer-70B with fewer tokens. In its best mode, DeepReviewer-14B achieves win rates of 88.21\% and 80.20\% against GPT-o1 and DeepSeek-R1 in evaluations. Our work sets a new benchmark for LLM-based paper review, with all resources publicly available. The code, model, dataset and demo have be released in http://ai-researcher.net.

Minjun Zhu, Yixuan Weng, Linyi Yang, Yue Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Review Feedback GenerationRMR-75K (val)
Pairwise Win Rate64.8
72
Automated Peer Review EvaluationDeepReview-13K 1.0 (test)
H-Max Technical Accuracy5.19
30
Scientific question generationIntelliReward (test)
Effort0.00e+0
19
AI Peer ReviewPaperAudit-Dataset ICML branch
Novelty7.9
18
Paper Acceptance DecisionICLR 2025 (test)
Accuracy68.45
15
Paper Quality EvaluationICLR 2025 (test)
Jaccard Index31.27
15
Automated Peer ReviewDeepReview-13K 2025 (test)
Technical Accuracy Win93.2
14
Automated Peer ReviewDeepReview-13K (test)
Technical Accuracy Win (%)1
10
Scientific rebuttal generationScientific Rebuttal Evaluation dataset (test)
BLEU@412.4
9
Scientific Review Feedback GenerationICLR LLM-as-a-Judge 2025 (test)
Actionability Score3.23
9
Showing 10 of 17 rows

Other info

Follow for update