Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest
About
Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token \emph{prediction} into \emph{extraction} for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, \emph{Cuckoo}, with 102.6M extractive data converted from LLM's pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Named Entity Recognition | CoNLL 03 | -- | 102 | |
| Named Entity Recognition | MIT Restaurant | Micro-F170.3 | 50 | |
| Extractive Question Answering | SQuAD 2.0 | F1 Score69.41 | 34 | |
| Relation Extraction | CoNLL 04 | F170.47 | 24 | |
| Named Entity Recognition | MIT Movie | Entity F167.23 | 22 | |
| Relation Extraction | ADE | Relation Strict F176.05 | 20 | |
| Machine Reading Comprehension | Instruction-following IE Preference (test) | F1 Score70.95 | 12 | |
| Named Entity Recognition | Instruction-following IE Disambiguation (test) | F1 Score37.75 | 12 | |
| Named Entity Recognition | Instruction-following IE Miscellaneous (test) | F1 Score51.86 | 12 | |
| Named Entity Recognition | BioNLP 2004 | F1 Score58.39 | 12 |