Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders
About
Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment. Interpretability analyses further reveal a clear asymmetry: models process code-mixed text through an English-dominant semantic subspace, while native-script Hindi provides complementary signals that reduce representational uncertainty. Motivated by these findings, we introduce a trilingual post-training alignment objective that brings code-mixed representations closer to both constituent languages simultaneously, yielding more balanced cross-lingual alignment and downstream gains on sentiment analysis and hate speech detection - showing that grounding code-mixed representations in their constituent languages meaningfully helps cross-lingual understanding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Hate speech classification | Code-mixed (CM) (test) | Macro F169.72 | 42 | |
| Hate speech classification | English (en) (test) | Macro F1 Score70.03 | 42 | |
| Hate speech classification | Hindi (HI) (test) | Macro F171.96 | 42 | |
| Hate speech classification | CM, EN, and HI (test) | Consistency0.6928 | 42 | |
| Cross-lingual Alignment | EN-HI-CM (test) | EN→CM Alignment Score86.4 | 24 | |
| Sentiment Classification | Code-mixed (CM) (test) | Macro-F168.92 | 18 | |
| Sentiment Classification | Hindi (HI) (test) | Macro-F170.44 | 18 | |
| Sentiment Classification | CM, EN, and HI (test) | Consistency66.18 | 18 | |
| Sentiment Classification | English (en) (test) | Macro F173.16 | 18 |