Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation

About

This work introduces {\it PrahokBART}, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.

Hour Kaing, Raj Dabre, Haiyue Song, Van-Hien Tran, Hideki Tanaka, Masao Utiyama• 2025

Related benchmarks

TaskDatasetResultRank
Text SummarizationText Summarization
ROUGE-L26.23
16
Headline GenerationHeadGen
ROUGE-L22.92
4
Machine TranslationEnglish-Khmer Translation Dataset
COMET Score77.69
4
Machine TranslationKhmer-English Translation Dataset
COMET82
4
Machine TranslationEnglish-Khmer en→km
BLEU24.64
4
Machine TranslationKhmer-English km→en
BLEU27.76
4
Text SummarizationLr-sum (test)
ROUGE-130.6
4
Showing 7 of 7 rows

Other info

Follow for update