Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain
About
This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text Embedding | MTEB | MTEB Score56.43 | 45 | |
| Text Embedding | MTEB Turkish (test) | Overall MTEB Score56.84 | 23 | |
| Retrieval | Legal | Legal Score47.52 | 10 | |
| Legal Retrieval | Turkish Legal | Legal Score47.52 | 9 | |
| Masked Language Modeling | Turkish Datasets (blackerx/turkish_v2, fthbrmnby/turkish_product_reviews, hazal/Turkish-Biomedical-corpus-trM, newmindai/EuroHPC-Legal) (test) | MLM Avg (%)67.25 | 7 | |
| Information Retrieval | Turkish Legal Domain (test) | Legal Score0.4752 | 6 | |
| Classification | Turkish Classification Datasets 6 datasets average | CLF Score74.8 | 3 | |
| General Language Understanding | Turkish Non-Retrieval Tasks Consolidated | Overall Accuracy81.8 | 3 | |
| Named Entity Recognition | Turkish NER 2 datasets average | NER Score84.8 | 3 | |
| Natural Language Inference | MedNLI Turkish | NLI Accuracy84.3 | 3 |