Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

About

LLM-generated peer reviews are increasingly common at major venues, yet their deficiencies are hard to detect because they are uniformly fluent and well-structured. Existing work either classifies authorship without judging quality, or scores quality with features designed for human-written reviews; no prior system detects deficiencies in LLM-generated reviews at the level of individual defect types. To bridge the gap, we introduce TADDLE, a Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews, together with the first expert-annotated benchmark for this task. Our benchmark comprises 1,800 reviews on 50 ICLR 2025 papers, multi-label-annotated by 18 domain experts against a taxonomy of six defect categories (plus a non-deficient label). TADDLE decomposes detection into four specialized analysis tools -- Verify, Correct, Complete, and Transform -- orchestrated by an agent; an integrator synthesizes their outputs into binary and multi-label classifications via two-stage semi-supervised learning. Extensive experiments show that TADDLE performs strongly on both binary detection and the multi-label classification task. We release the benchmark and code at https://github.com/AquariusAQ/TADDLE.

Hanqi Duan, Xiang Li• 2026

Related benchmarks

TaskDatasetResultRank
Binary deficiency detectiongold-labeled ICLR (test)
Accuracy86
16
Persona DiscriminationICML Cross-conference
Persona Separability (Δ)0.453
16
Persona DiscriminationNeurIPS Cross-conference
Persona Separability (Δ)0.418
16
Persona DiscriminationLlama Cross-generator 3.3-70B
Persona Separability (Δ)0.427
16
Persona DiscriminationMiniMax Cross-generator M2.5
Persona Separability (Δ)0.281
16
Fine-grained multi-label classificationICLR gold-labeled (test)
Jaccard Similarity74.24
8
Showing 6 of 6 rows

Other info

Follow for update