Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Auto-ARGUE: LLM-Based Report Generation Evaluation

About

Generation of citation-backed reports is a primary use case for retrieval-augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, tools designed for report generation are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for report generation evaluation. We present analysis of Auto-ARGUE on the report generation pilot task from the TREC 2024 NeuCLIR track and on two tasks from the TREC 2024 RAG track, showing good system-level correlations with human judgments. Additionally, we release ARGUE-Viz, a web app for visualization and fine-grained analysis of Auto-ARGUE judgments and scores.

William Walden, Marc Mason, Orion Weller, Laura Dietz, John Conroy, Neil Molino, Hannah Recknor, Bryan Li, Gabrielle Kaili-May Liu, Yu Hou, Dawn Lawrie, James Mayfield, Eugene Yang• 2025

Related benchmarks

TaskDatasetResultRank
Report Generation EvaluationRAGTIME
Spearman's Rho0.748
3
Report Generation EvaluationNeuCLIR
Spearman's Rho0.804
3
Showing 2 of 2 rows

Other info

Follow for update