Cross-lingual Similarity of Multilingual Representations Revisited
About
Related works used indexes like CKA and variants of CCA to measure the similarity of cross-lingual representations in multilingual language models. In this paper, we argue that assumptions of CKA/CCA align poorly with one of the motivating goals of cross-lingual learning analysis, i.e., explaining zero-shot cross-lingual transfer. We highlight what valuable aspects of cross-lingual similarity these indexes fail to capture and provide a motivating case study \textit{demonstrating the problem empirically}. Then, we introduce \textit{Average Neuron-Wise Correlation (ANC)} as a straightforward alternative that is exempt from the difficulties of CKA/CCA and is good specifically in a cross-lingual context. Finally, we use ANC to construct evidence that the previously introduced ``first align, then predict'' pattern takes place not only in masked language models (MLMs) but also in multilingual models with \textit{causal language modeling} objectives (CLMs). Moreover, we show that the pattern extends to the \textit{scaled versions} of the MLMs and CLMs (up to 85x original mBERT).\footnote{Our code is publicly available at \url{https://github.com/TartuNLP/xsim}}
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Cross-Lingual Knowledge Alignment | BMLAMA | Pearson Correlation0.9156 | 48 | |
| Zero-Shot Cross-Lingual Transfer | XNLI | Pearson Correlation0.9082 | 48 | |
| Pearson correlation analysis | m-ARC | Pearson Correlation0.9683 | 13 | |
| Cross-lingual transferability | FLORES | Avg Pearson Correlation0.8537 | 6 | |
| Multilingual performance | FLORES | Avg Pearson Correlation0.9313 | 6 | |
| Pearson correlation analysis | m-MMLU | Pearson Correlation (r)0.976 | 6 |