Refining and Reusing Annotation Guidelines for LLM Annotation

About

While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism, introducing an iterative moderation framework that simulates the early phases of annotation projects. We evaluate three hypotheses: (1) the efficacy of guideline integration, (2) the advantage of reasoning optimized models, and (3) the viability of moderation under minimal supervision. Testing across biomedical NER tasks (NCBI Disease, BC5CDR, BioRED) with three LLM families (GPT, Gemini, DeepSeek), our results empirically confirm all three hypotheses. While the iterative moderation framework shows good potential in effectively refining guidelines, our analysis also reveals substantial room for improvement.

Kon Woo Kim, Jin-Dong Kim, Akiko Aizawa• 2026

Related benchmarks

Task	Dataset	Result
Named Entity Recognition	BC5CDR	F1 Score86	102
Named Entity Recognition	BioRED	F1 Score (%)82	19
Entity Annotation	NCBI 791	Precision82	9

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord