Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest

About

Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token \emph{prediction} into \emph{extraction} for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, \emph{Cuckoo}, with 102.6M extractive data converted from LLM's pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.

Letian Peng, Zilong Wang, Feng Yao, Jingbo Shang• 2025

Related benchmarks

TaskDatasetResultRank
Named Entity RecognitionCoNLL 03--
102
Named Entity RecognitionMIT Restaurant
Micro-F170.3
50
Extractive Question AnsweringSQuAD 2.0
F1 Score69.41
34
Relation ExtractionCoNLL 04
F170.47
24
Named Entity RecognitionMIT Movie
Entity F167.23
22
Relation ExtractionADE
Relation Strict F176.05
20
Machine Reading ComprehensionInstruction-following IE Preference (test)
F1 Score70.95
12
Named Entity RecognitionInstruction-following IE Disambiguation (test)
F1 Score37.75
12
Named Entity RecognitionInstruction-following IE Miscellaneous (test)
F1 Score51.86
12
Named Entity RecognitionBioNLP 2004
F1 Score58.39
12
Showing 10 of 12 rows

Other info

Code

Follow for update