From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment

About

Large language models (LLMs) have demonstrated remarkable multilingual capabilities, however, how to evaluate cross-lingual alignment remains underexplored. Existing alignment benchmarks primarily focus on sentence embeddings, but prior research has shown that neural models tend to induce a non-smooth representation space, which impact of semantic alignment evaluation on low-resource languages. Inspired by neuroscientific findings that similar information activates overlapping neuronal regions, we propose a novel Neuron State-Based Cross-Lingual Alignment (NeuronXA) to assess the cross-lingual a lignment capabilities of LLMs, which offers a more semantically grounded approach to assess cross-lingual alignment. We evaluate NeuronXA on several prominent multilingual LLMs (LLaMA, Qwen, Mistral, GLM, and OLMo) across two transfer tasks and three multilingual benchmarks. The results demonstrate that with only 100 parallel sentence pairs, NeuronXA achieves a Pearson correlation of 0.9556 with downstream tasks performance and 0.8514 with transferability. These findings demonstrate NeuronXA's effectiveness in assessing both cross-lingual alignment and transferability, even with a small dataset. This highlights its potential to advance cross-lingual alignment research and to improve the semantic understanding of multilingual LLMs.

Chongxuan Huang, Yongshi Ye, Biao Fu, Qifeng Su, Xiaodong Shi• 2025

Related benchmarks

Task	Dataset	Result
Cross-lingual Alignment Correlation	m-ARC FLORES (test)	Pearson Correlation0.9867	81
Cross-lingual Alignment Correlation	m-MMLU FLORES (test)	Pearson Correlation0.9859	81
Cross-lingual Alignment Correlation	Belebele FLORES (test)	Pearson Correlation0.9796	81
Zero-Shot Cross-Lingual Transfer	XNLI	Pearson Correlation0.9639	48
Cross-Lingual Knowledge Alignment	BMLAMA	Pearson Correlation0.9062	48
Pearson correlation analysis	m-ARC	Pearson Correlation0.9847	13
Downstream task performance correlation	MARC, MMLU, and Belebele (test)	Avg Pearson Correlation0.9621	8
Zero-Shot Cross-Lingual Transfer	XNLI (test)	Pearson Correlation0.9377	8
Cross-lingual transferability	FLORES	Avg Pearson Correlation0.8597	6
Multilingual performance	FLORES	Avg Pearson Correlation0.9541	6

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord