TabiBERT: A Large-Scale ModernBERT Foundation Model and A Unified Benchmark for Turkish

About

Since the inception of BERT, encoder-only Transformers have evolved significantly in computational efficiency, training stability, and long-context modeling. ModernBERT consolidates these advances by integrating Rotary Positional Embeddings (RoPE), FlashAttention, and refined normalization. Despite these developments, Turkish NLP lacks a monolingual encoder trained from scratch, incorporating such modern architectural paradigms. This work introduces TabiBERT, a monolingual Turkish encoder based on ModernBERT architecture trained from scratch on a large, curated corpus. TabiBERT is pre-trained on one trillion tokens sampled from an 84.88B token multi-domain corpus: web text (73%), scientific publications (20%), source code (6%), and mathematical content (0.3%). It supports 8,192-token context length (16x original BERT), achieves up to 2.65x inference speedup, and reduces GPU memory consumption, enabling larger batch sizes. We introduce TabiBench with 28 datasets across eight task categories with standardized splits and protocols, evaluated using GLUE-style macro-averaging. TabiBERT attains 77.58 on TabiBench, outperforming BERTurk by 1.62 points and establishing state-of-the-art on five of eight categories, with particularly strong gains on question answering (+9.55 points), code retrieval (+2.41 points), and academic understanding (+0.66 points). Compared with task-specific prior best results, including specialized models like TurkishBERTweet, TabiBERT achieves +1.47 average improvement, indicating robust cross-domain generalization. We release model weights, training configurations, and evaluation code for transparent, reproducible Turkish encoder research.

Melik\c{s}ah T\"urker, A. Ebrar K{\i}z{\i}lo\u{g}lu, Onur G\"ung\"or, Susan \"Usk\"udarl{\i}• 2025

Related benchmarks

Task	Dataset	Result
Text Embedding	MTEB	Classification Score65.36	50
Text Embedding	MTEB Turkish (test)	Overall MTEB Score37.77	23
Legal Retrieval	Turkish Legal	Legal Score38.54	9
Masked Language Modeling	Turkish Datasets (blackerx/turkish_v2, fthbrmnby/turkish_product_reviews, hazal/Turkish-Biomedical-corpus-trM, newmindai/EuroHPC-Legal) (test)	MLM Avg (%)69.57	7
Turkish Natural Language Understanding and Retrieval	TabiBench 1.0 (test)	Text Clf F183.44	5
Turkish Natural Language Understanding	TabiBench 1.0 (test)	TabiBench Score77.58	4
Named Entity Recognition	Turkish NER 2 datasets average	NER Score84.4	3
Semantic Textual Similarity	Turkish STS Dataset	STS Score0.85	3
Classification	Turkish Classification Datasets 6 datasets average	CLF Score72.2	3
General Language Understanding	Turkish Non-Retrieval Tasks Consolidated	Overall Accuracy77.2	3

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord