Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders

About

Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment. Interpretability analyses further reveal a clear asymmetry: models process code-mixed text through an English-dominant semantic subspace, while native-script Hindi provides complementary signals that reduce representational uncertainty. Motivated by these findings, we introduce a trilingual post-training alignment objective that brings code-mixed representations closer to both constituent languages simultaneously, yielding more balanced cross-lingual alignment and downstream gains on sentiment analysis and hate speech detection - showing that grounding code-mixed representations in their constituent languages meaningfully helps cross-lingual understanding.

Debajyoti Mazumder, Divyansh Pathak, Prashant Kodali, Jasabanta Patro• 2026

Related benchmarks

Task	Dataset	Result
Hate speech classification	Code-mixed (CM) (test)	Macro F169.72	42
Hate speech classification	English (en) (test)	Macro F1 Score70.03	42
Hate speech classification	Hindi (HI) (test)	Macro F171.96	42
Hate speech classification	CM, EN, and HI (test)	Consistency0.6928	42
Cross-lingual Alignment	EN-HI-CM (test)	EN→CM Alignment Score86.4	24
Sentiment Classification	Code-mixed (CM) (test)	Macro-F168.92	18
Sentiment Classification	Hindi (HI) (test)	Macro-F170.44	18
Sentiment Classification	CM, EN, and HI (test)	Consistency66.18	18
Sentiment Classification	English (en) (test)	Macro F173.16	18

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord