Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation

About

This study introduces a novel approach for generating high-quality, language-specific chat corpora using a self-chat mechanism. We combine a generator LLM for creating new samples and an embedder LLM to ensure diversity. A new Masked Language Modelling (MLM) model-based quality assessment metric is proposed for evaluating and filtering the corpora. Utilizing the llama2-70b as the generator and a multilingual sentence transformer as embedder, we generate an Italian chat corpus and refine the Fauno corpus, which is based on translated English ChatGPT self-chat data. The refinement uses structural assertions and Natural Language Processing techniques. Both corpora undergo a comprehensive quality evaluation using the proposed MLM model-based quality metric. The Italian LLM fine-tuned with these corpora demonstrates significantly enhanced language comprehension and question-answering skills. The resultant model, cerbero-7b, establishes a new state-of-the-art for Italian LLMs. This approach marks a substantial advancement in the development of language-specific LLMs, with a special emphasis on augmenting corpora for underrepresented languages like Italian.

Federico A. Galatolo, Mario G.C.A. Cimino• 2023

Related benchmarks

TaskDatasetResultRank
TranslationFLORES-200 it-en (devtest)
sacreBLEU29.1301
35
TranslationFLORES-200 en-it (devtest)
sacreBLEU25.6956
35
Machine TranslationNTREX (en->it) 128 (test)
sacreBLEU29.7079
35
Machine TranslationWikinews-25 en->it
sacreBLEU35.3794
35
Machine TranslationWikinews-25 it->en
sacreBLEU33.4609
35
Machine TranslationNTREX it->en 128 (test)
sacreBLEU30.976
35
Machine TranslationTatoeba en->it
sacreBLEU46.7861
33
Machine TranslationTatoeba it->en
sacreBLEU49.1672
33
Showing 8 of 8 rows

Other info

Follow for update