Toward Robust Multilingual Adaptation of LLMs for Low-Resource Languages
About
Large language models (LLMs) continue to struggle with low-resource languages, primarily due to limited training data, translation noise, and unstable cross-lingual alignment. To address these challenges, we propose LiRA (Linguistic Robust Anchoring for LLMs)-a plug-and-play framework that requires only lightweight fine-tuning on top of existing pretrained backbones. LiRA jointly optimizes representation stability and cross-lingual semantic consistency by combining two key components: Arca (Anchored Representation Composition Architecture), which aligns low-resource inputs to a shared English semantic space through anchor-based alignment and collaborative encoding; and LaSR (Language-coupled Semantic Reasoner), a lightweight, language-aware head that enforces consistency regularization for unified cross-lingual understanding, retrieval, and reasoning. We theoretically show that under controlled anchoring error and translation-induced bias, LiRA guarantees bounded representation deviation and stable downstream performance under local Lipschitz continuity. To facilitate research, we release a new multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Extensive experiments across diverse low-resource benchmarks demonstrate consistent improvements in retrieval, ranking, question answering, and reasoning tasks. Code will be publicly available on GitHub, and the dataset will be hosted on Hugging Face.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MGSM (test) | Accuracy (ZH)70.3 | 80 | |
| Retrieval | BelebeleRetrieval | nDCG@1087.03 | 26 | |
| Retrieval | LazRetrieval | BD Retrieval Score66.3 | 16 | |
| Information Retrieval | MLQA Retrieval | nDCG@1082.01 | 14 | |
| Semantic Textual Similarity | STS22 | Pearson Correlation76.55 | 14 | |
| Multilingual Commonsense Reasoning | X-CSQA | Accuracy (SW)40.8 | 10 |