Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

BanglaLlama: LLaMA for Bangla Language

About

Bangla is a language spoken by approximately 240 million native speakers and around 300 million people worldwide. Despite being the 5th largest spoken language in the world, Bangla is still a "low-resource" language, and existing pretrained language models often struggle to perform well on Bangla Language Processing (BLP) tasks. This paper addresses this gap by: (1) introducing two high-quality translated Bangla-instruction datasets totaling 224k samples - Bangla-Orca (172k) and Bangla-Alpaca (52k); and (2) leveraging these datasets to develop BanglaLlama, an open-source family of Bangla-specific LLMs, consisting of five base and instruct variants. We present our methodology, two large datasets, and comprehensive benchmarking results showcasing the effectiveness of our dataset and model on multiple benchmarks. We believe our proposed datasets and models will serve as the new standard baseline for future research focused on this widely spoken yet "low-resource" language.

Abdullah Khan Zehady, Shubhashis Roy Dipta, Naymul Islam, Safi Al Mamun, Santu Karmaker• 2024

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningPIQA 1.0 (test)
Accuracy53
48
Commonsense ReasoningCommonsenseQA (CSQA) v1.0 (test)
Accuracy24
46
Multiple-choice Question AnsweringBangla MMLU 1.0 (test)
Accuracy33
33
Open-Book Question AnsweringOpenBookQA 1.0 (test)
Accuracy33
33
Yes/No Reading ComprehensionBoolQ 1.0 (test)
Normalized Accuracy54
33
Showing 5 of 5 rows

Other info

Follow for update