Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Refining and Reusing Annotation Guidelines for LLM Annotation

About

While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism, introducing an iterative moderation framework that simulates the early phases of annotation projects. We evaluate three hypotheses: (1) the efficacy of guideline integration, (2) the advantage of reasoning optimized models, and (3) the viability of moderation under minimal supervision. Testing across biomedical NER tasks (NCBI Disease, BC5CDR, BioRED) with three LLM families (GPT, Gemini, DeepSeek), our results empirically confirm all three hypotheses. While the iterative moderation framework shows good potential in effectively refining guidelines, our analysis also reveals substantial room for improvement.

Kon Woo Kim, Jin-Dong Kim, Akiko Aizawa• 2026

Related benchmarks

TaskDatasetResultRank
Named Entity RecognitionBC5CDR
F1 Score86
102
Named Entity RecognitionBioRED
F1 Score (%)82
19
Entity AnnotationNCBI 791
Precision82
9
Showing 3 of 3 rows

Other info

Follow for update