Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions

About

While Large Language Models (LLMs) acquire vast knowledge during pre-training, they often lack domain-specific, new, or niche information. Continual pre-training (CPT) attempts to address this gap but suffers from catastrophic forgetting and inefficiencies in low-data regimes. We introduce Knowledge-Instruct, a novel approach to efficiently inject knowledge from limited corpora through pure instruction-tuning. By generating information-dense synthetic instruction data, it effectively integrates new knowledge while preserving general reasoning and instruction-following abilities. Knowledge-Instruct demonstrates superior factual memorization, minimizes catastrophic forgetting, and remains scalable by leveraging synthetic data from relatively small language models. Additionally, it enhances contextual understanding, including complex multi-hop reasoning, facilitating integration with retrieval systems. We validate its effectiveness across diverse benchmarks, including Companies, a new dataset that we release to measure knowledge injection capabilities.

Oded Ovadia, Meni Brief, Rachel Lemberg, Eitam Sheetrit• 2025

Related benchmarks

TaskDatasetResultRank
Question AnsweringPop-QA Cities-20
BLEU-124.2
10
Question AnsweringWikitext-10
BLEU-10.277
10
Question AnsweringSQuAD 2.0
BLEU-13.6
10
Question Answering EvaluationPop-QA Cities-20
Factual Accuracy1.54
10
Question Answering EvaluationSQuAD 2.0
Factual Accuracy1.4
10
Question Answering EvaluationWikitext-10
Factual Accuracy1.2
10
Legal ReasoningLegalBench CUAD Cardlytics Buffalo Wild Wings PF Hospitality 2023
Accuracy (Cardl)78.6
6
Showing 7 of 7 rows

Other info

Follow for update