Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources
About
The Swa-bhasha Resource Hub provides a comprehensive collection of data resources and algorithms developed for Romanized Sinhala to Sinhala transliteration between 2020 and 2025. These resources have played a significant role in advancing research in Sinhala Natural Language Processing (NLP), particularly in training transliteration models and developing applications involving Romanized Sinhala. The current openly accessible data sets and corresponding tools are made publicly available through this hub. This paper presents a detailed overview of the resources contributed by the authors and includes a comparative analysis of existing transliteration applications in the domain.
Deshan Sumanathilaka, Sameera Perera, Sachithya Dharmasiri, Maneesha Athukorala, Anuja Dilrukshi Herath, Rukshan Dias, Pasindu Gamage, Ruvan Weerasinghe, Y.H.P.P. Priyadarshana• 2025
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Romanized Sinhala to Sinhala Transliteration | SinMix2Mono Golden dataset 1.0 (test) | BLEU49.7 | 11 | |
| Back-transliteration | IndoNLP (Set 1) | -- | 6 | |
| Back-transliteration | IndoNLP (Set 2) | -- | 6 | |
| Transliteration disambiguation | Transliteration disambiguation Dataset (Set 1) | -- | 5 | |
| Transliteration disambiguation | Transliteration disambiguation Dataset (Set 2) | -- | 5 | |
| Code-Mixed Transliteration Ambiguity Resolution | SinMix2Mono Code-Mixed transliteration ambiguity (test) | BLEU49.29 | 4 | |
| Transliteration Ambiguity Resolution | Sinhala transliteration ambiguity dataset 2025a (test) | BLEU77.64 | 4 |
Showing 7 of 7 rows