Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain

About

This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.

\"Ozg\"ur U\u{g}ur, Mahmut G\"oksu, Mahmut \c{C}imen, Musa Y{\i}lmaz, Esra \c{S}avirdi, Alp Talha Demir, Rumeysa G\"ull\"uce, \.Iclal \c{C}etin, \"Omer Can Sa\u{g}ba\c{s}• 2026

Related benchmarks

Task	Dataset	Result
Text Embedding	MTEB	Classification Score67.47	50
Text Embedding	MTEB Turkish (test)	Overall MTEB Score56.84	23
Retrieval	Legal	Legal Score47.52	10
Legal Retrieval	Turkish Legal	Legal Score47.52	9
Masked Language Modeling	Turkish Datasets (blackerx/turkish_v2, fthbrmnby/turkish_product_reviews, hazal/Turkish-Biomedical-corpus-trM, newmindai/EuroHPC-Legal) (test)	MLM Avg (%)67.25	7
Information Retrieval	Turkish Legal Domain (test)	Legal Score0.4752	6
Classification	Turkish Classification Datasets 6 datasets average	CLF Score74.8	3
General Language Understanding	Turkish Non-Retrieval Tasks Consolidated	Overall Accuracy81.8	3
Named Entity Recognition	Turkish NER 2 datasets average	NER Score84.8	3
Natural Language Inference	MedNLI Turkish	NLI Accuracy84.3	3

Showing 10 of 14 rows

Other info

GitHub

Follow for update

@wizwand_team Discord