Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain

About

This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.

\"Ozg\"ur U\u{g}ur, Mahmut G\"oksu, Mahmut \c{C}imen, Musa Y{\i}lmaz, Esra \c{S}avirdi, Alp Talha Demir, Rumeysa G\"ull\"uce, \.Iclal \c{C}etin, \"Omer Can Sa\u{g}ba\c{s}• 2026

Related benchmarks

TaskDatasetResultRank
Text EmbeddingMTEB
MTEB Score56.43
45
Text EmbeddingMTEB Turkish (test)
Overall MTEB Score56.84
23
RetrievalLegal
Legal Score47.52
10
Legal RetrievalTurkish Legal
Legal Score47.52
9
Masked Language ModelingTurkish Datasets (blackerx/turkish_v2, fthbrmnby/turkish_product_reviews, hazal/Turkish-Biomedical-corpus-trM, newmindai/EuroHPC-Legal) (test)
MLM Avg (%)67.25
7
Information RetrievalTurkish Legal Domain (test)
Legal Score0.4752
6
ClassificationTurkish Classification Datasets 6 datasets average
CLF Score74.8
3
General Language UnderstandingTurkish Non-Retrieval Tasks Consolidated
Overall Accuracy81.8
3
Named Entity RecognitionTurkish NER 2 datasets average
NER Score84.8
3
Natural Language InferenceMedNLI Turkish
NLI Accuracy84.3
3
Showing 10 of 14 rows

Other info

GitHub

Follow for update