On Generalization in Coreference Resolution
About
While coreference resolution is defined independently of dataset domain, most models for performing coreference resolution do not transfer well to unseen domains. We consolidate a set of 8 coreference resolution datasets targeting different domains to evaluate the off-the-shelf performance of models. We then mix three datasets for training; even though their domain, annotation guidelines, and metadata differ, we propose a method for jointly training a single model on this heterogeneous data mixture by using data augmentation to account for annotation differences and sampling to balance the data quantities. We find that in a zero-shot setting, models trained on a single dataset transfer poorly while joint training yields improved overall performance, leading to better generalization in coreference resolution models. This work contributes a new benchmark for robust coreference resolution and multiple new state-of-the-art results.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Coreference Resolution | WSC | Accuracy62.7 | 96 | |
| Coreference Resolution | OntoNotes | MUC85.3 | 23 | |
| Coreference Resolution | WikiCoref (WC) (test) | Average F160.1 | 12 | |
| Coreference Resolution | LitBank (test) | Avg. F176.5 | 10 | |
| Coreference Resolution | OntoNotes (ON) | CoNLL F180.6 | 8 | |
| Coreference Resolution | LitBank LB₀ | CoNLL F178.2 | 8 | |
| Coreference Resolution | PreCo (PC) | CoNLL F187.8 | 8 | |
| Coreference Resolution | Character Identification (CI) | CoNLL F159.5 | 8 | |
| Coreference Resolution | WikiCoref (WC) | CoNLL F162.5 | 8 | |
| Coreference Resolution | Quizbowl Coreference (QBC) | CoNLL F150.5 | 8 |