Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training

About

Prior work on Data-To-Text Generation, the task of converting knowledge graph (KG) triples into natural text, focused on domain-specific benchmark datasets. In this paper, however, we verbalize the entire English Wikidata KG, and discuss the unique challenges associated with a broad, open-domain, large-scale verbalization. We further show that verbalizing a comprehensive, encyclopedic KG like Wikidata can be used to integrate structured KGs and natural language corpora. In contrast to the many architectures that have been developed to integrate these two sources, our approach converts the KG into natural text, allowing it to be seamlessly integrated into existing language models. It carries the further advantages of improved factual accuracy and reduced toxicity in the resulting language model. We evaluate this approach by augmenting the retrieval corpus in a retrieval language model and showing significant improvements on the knowledge intensive tasks of open domain QA and the LAMA knowledge probe.

Oshin Agarwal, Heming Ge, Siamak Shakeri, Rami Al-Rfou• 2020

Related benchmarks

Task	Dataset	Result
Open Question Answering	Natural Questions (NQ) (test)	Exact Match (EM)41.5	134
Open-domain Question Answering	WebQuestions (WebQ) (test)	Exact Match (EM)43.9	55
Roll call vote prediction	Roll call vote prediction (Random)	BAcc89.13	27
Roll call vote prediction	Roll call vote prediction (Time-Based)	Balanced Accuracy90.8	26
Misinformation Detection	SLN (test)	Micro F184.11	26
political perspective detection	SemEval	Accuracy86.4	17
political perspective detection	Allsides	Accuracy80.71	17
Misinformation Detection	LUN	Macro F157.3	17
political perspective detection	SemEval (test)	Accuracy0.864	9
political perspective detection	Allsides (test)	Accuracy80.71	9

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord