Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation

About

We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code. Through targeted adaptation, this model family achieves state-of-the-art results on Catalan- and Spanish-specific tasks, while establishing robust performance across specialized biomedical and legal domains. To bridge the gap between research and production, we incorporate Matryoshka Representation Learning (MRL), enabling flexible vector sizing that significantly reduces inference and storage costs. Ultimately, the MrBERT family demonstrates that modern encoder architectures can be optimized for both localized linguistic excellence and efficient, high-stakes domain specialization. We open source the complete model family on Huggingface.

Daniel Tamayo, I\~naki Lacunza, Paula Rivera-Hidalgo, Severino Da Dalt, Javier Aula-Blasco, Aitor Gonzalez-Agirre, Marta Villegas• 2026

Related benchmarks

TaskDatasetResultRank
Information RetrievalTREC-COVID
NDCG@1049.53
44
Cross-lingual Language UnderstandingXTREME
XNLI Accuracy81.26
43
RetrievalSCIDOCS
nDCG@1010.05
18
Natural Language Understanding (Catalan)CLUB
AnCora-ca NER F188.04
7
Named Entity Recognitionbsc-bio-distemist ES
F1 Score78.07
6
Named Entity Recognitionpharmaconer ES
F1 Score89.92
6
RetrievalAbSanitas ES
nDCG@1053.49
6
Named Entity Recognitioncantemist ES
F1 Score73.4
6
RetrievalR2Med EN
nDCG@1010.15
6
Spanish language understandingEvalES (test)
UD-POS-es (F1)99.08
6
Showing 10 of 18 rows

Other info

Follow for update