GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text

About

Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few existing resources providing large amounts of standardized, easily accessible IGT data, limiting their applicability to linguistic research, and making it difficult to use such data in NLP modeling. We compile the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation. We normalize much of our data to follow a standard set of labels across languages. Furthermore, we explore the task of automatically generating IGT in order to aid documentation projects. As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus. We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6\%. Our pretrained model and dataset are available on Hugging Face.

Michael Ginn, Lindia Tjuatja, Taiqi He, Enora Rice, Graham Neubig, Alexis Palmer, Lori Levin (2) __INSTITUTION_7__ University of Colorado, (2) Carnegie Mellon University)• 2024

Related benchmarks

Task	Dataset	Result
Glossing	Multilingual held-out (test)	Morpheme Error Rate (arp)0.161	5
Glossing	Uspanteko usp (test)	Morpheme Accuracy78.6	3
Glossing	Daakaka ddo (test)	Morpheme Accuracy83.6	3
Glossing	Arapaho arp (test)	Morpheme Accuracy82.1	3
Glossing	Gitksan git (test)	Morpheme Accuracy10.1	3
Glossing	Lezgian lez (test)	Morpheme Accuracy57.3	3
Glossing	Natuügu ntu (test)	Morpheme Accuracy62.8	3
Glossing	Nyangbo nyb (test)	Morpheme Accuracy87.4	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord