BanglaLlama: LLaMA for Bangla Language

About

Bangla is a language spoken by approximately 240 million native speakers and around 300 million people worldwide. Despite being the 5th largest spoken language in the world, Bangla is still a "low-resource" language, and existing pretrained language models often struggle to perform well on Bangla Language Processing (BLP) tasks. This paper addresses this gap by: (1) introducing two high-quality translated Bangla-instruction datasets totaling 224k samples - Bangla-Orca (172k) and Bangla-Alpaca (52k); and (2) leveraging these datasets to develop BanglaLlama, an open-source family of Bangla-specific LLMs, consisting of five base and instruct variants. We present our methodology, two large datasets, and comprehensive benchmarking results showcasing the effectiveness of our dataset and model on multiple benchmarks. We believe our proposed datasets and models will serve as the new standard baseline for future research focused on this widely spoken yet "low-resource" language.

Abdullah Khan Zehady, Shubhashis Roy Dipta, Naymul Islam, Safi Al Mamun, Santu Karmaker• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	PIQA 1.0 (test)	Accuracy53	64
Commonsense Reasoning	CommonsenseQA (CSQA) v1.0 (test)	Accuracy24	46
Multiple-choice Question Answering	Bangla MMLU 1.0 (test)	Accuracy33	33
Open-Book Question Answering	OpenBookQA 1.0 (test)	Accuracy33	33
Yes/No Reading Comprehension	BoolQ 1.0 (test)	Normalized Accuracy54	33

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord