$T^5Score$: A Methodology for Automatically Assessing the Quality of LLM Generated Multi-Document Topic Sets

About

Using LLMs for Multi-Document Topic Extraction has recently gained popularity due to their apparent high-quality outputs, expressiveness, and ease of use. However, most existing evaluation practices are not designed for LLM-generated topics and result in low inter-annotator agreement scores, hindering the reliable use of LLMs for the task. To address this, we introduce $T^5Score$, an evaluation methodology that decomposes the quality of a topic set into quantifiable aspects, measurable through easy-to-perform annotation tasks. This framing enables a convenient, manual or automatic, evaluation procedure resulting in a strong inter-annotator agreement score. To substantiate our methodology and claims, we perform extensive experimentation on multiple datasets and report the results.

Itamar Trainin, Omri Abend• 2024

Related benchmarks

Task	Dataset	Result
Topic Generation	USC SF	--	13
Topic Generation	Multi-News	--	8
Topic Generation	Human Baseline	--	8

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord