Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest

About

Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token \emph{prediction} into \emph{extraction} for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, \emph{Cuckoo}, with 102.6M extractive data converted from LLM's pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.

Letian Peng, Zilong Wang, Feng Yao, Jingbo Shang• 2025

Related benchmarks

Task	Dataset	Result
Named Entity Recognition	CoNLL 03	--	135
Named Entity Recognition	MIT Movie	Entity F167.23	71
Relation Extraction	CoNLL 04	F170.47	59
Named Entity Recognition	MIT Restaurant	Micro-F170.3	57
Extractive Question Answering	SQuAD 2.0	F1 Score69.41	34
Relation Extraction	ADE	Relation Strict F176.05	20
Machine Reading Comprehension	Instruction-following IE Preference (test)	F1 Score70.95	12
Named Entity Recognition	Instruction-following IE Disambiguation (test)	F1 Score37.75	12
Named Entity Recognition	Instruction-following IE Miscellaneous (test)	F1 Score51.86	12
Named Entity Recognition	BioNLP 2004	F1 Score58.39	12

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord