X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions
About
Large language models respond well in high-resource languages like English but struggle in low-resource languages. It may arise from the lack of high-quality instruction following data in these languages. Directly translating English samples into these languages can be a solution but unreliable, leading to responses with translation errors and lacking language-specific or cultural knowledge. To address this issue, we propose a novel method to construct cross-lingual instruction following samples with instruction in English and response in low-resource languages. Specifically, the language model first learns to generate appropriate English instructions according to the natural web texts in other languages as responses. The candidate cross-lingual instruction tuning samples are further refined and diversified. We have employed this method to build a large-scale cross-lingual instruction tuning dataset on 10 languages, namely X-Instruction. The instruction data built using our method incorporate more language-specific knowledge compared with the naive translation method. Experimental results have shown that the response quality of the model tuned on X-Instruction greatly exceeds the model distilled from a powerful teacher model, reaching or even surpassing the ones of ChatGPT. In addition, we find that models tuned on cross-lingual instruction following samples can follow the instruction in the output language without further tuning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Natural Language Inference | XNLI | -- | 111 | |
| Commonsense Reasoning | XStoryCloze | Average Score54 | 32 | |
| Causal Reasoning | XCOPA (test) | Accuracy (id)62 | 13 | |
| Instruction Following | Vicuna & WizardLM Finnish fi | Win Rate (vs ChatGPT)47 | 9 | |
| Instruction Following | Vicuna & WizardLM Indonesian | Win Rate (vs ChatGPT)50.3 | 9 | |
| Instruction Following | Vicuna & WizardLM Thai | Win Rate (vs ChatGPT)53 | 9 | |
| Instruction Following | Vicuna & WizardLM Turkish | Win Rate (vs ChatGPT)53.7 | 9 | |
| Instruction Following | Vicuna & WizardLM Vietnamese / vi | Win Rate (vs ChatGPT)57 | 9 | |
| Instruction Following | Vicuna & WizardLM Bengali bn | Win Rate (vs ChatGPT)68.8 | 9 | |
| Instruction Following | Vicuna & WizardLM Hindi | Win Rate (vs ChatGPT)65.8 | 9 |