Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders

About

Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment. Interpretability analyses further reveal a clear asymmetry: models process code-mixed text through an English-dominant semantic subspace, while native-script Hindi provides complementary signals that reduce representational uncertainty. Motivated by these findings, we introduce a trilingual post-training alignment objective that brings code-mixed representations closer to both constituent languages simultaneously, yielding more balanced cross-lingual alignment and downstream gains on sentiment analysis and hate speech detection - showing that grounding code-mixed representations in their constituent languages meaningfully helps cross-lingual understanding.

Debajyoti Mazumder, Divyansh Pathak, Prashant Kodali, Jasabanta Patro• 2026

Related benchmarks

TaskDatasetResultRank
Hate speech classificationCode-mixed (CM) (test)
Macro F169.72
42
Hate speech classificationEnglish (en) (test)
Macro F1 Score70.03
42
Hate speech classificationHindi (HI) (test)
Macro F171.96
42
Hate speech classificationCM, EN, and HI (test)
Consistency0.6928
42
Cross-lingual AlignmentEN-HI-CM (test)
EN→CM Alignment Score86.4
24
Sentiment ClassificationCode-mixed (CM) (test)
Macro-F168.92
18
Sentiment ClassificationHindi (HI) (test)
Macro-F170.44
18
Sentiment ClassificationCM, EN, and HI (test)
Consistency66.18
18
Sentiment ClassificationEnglish (en) (test)
Macro F173.16
18
Showing 9 of 9 rows

Other info

Follow for update