Long-Context Encoder Models for Polish Language Understanding
About
While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding. The results demonstrate that our model achieves the best average performance among Polish and multilingual models, significantly outperforming competitive solutions in long-context tasks while maintaining comparable quality on short texts.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Financial Language Understanding | FinBench 7 tasks (val) | FinBench Score85.19 | 13 | |
| General Language Understanding | All tasks (25 tasks) (val) | Overall Accuracy85.93 | 13 | |
| Language Understanding | Other tasks (9 tasks) (val) | Other Tasks Score83.92 | 13 | |
| Language Understanding | KLEJ 9 tasks (val) | KLEJ Score88.52 | 13 | |
| Long-context Language Understanding | Long tasks 4 tasks (val) | Long Tasks Score83.16 | 13 | |
| Binary Classification | IMDB | Accuracy96.03 | 9 | |
| Financial NLP | FinBench | Banking-Short Accuracy81.99 | 3 | |
| General Polish Language Understanding | Average 25 Tasks | Average Score85.93 | 3 | |
| Multi-Label Classification | MIPD | Weighted F168.5 | 3 | |
| Multi-Label Classification | EURLEX | Weighted F179.77 | 3 |