LlamBERT: Large-scale low-cost data annotation in NLP

About

Large Language Models (LLMs), such as GPT-4 and Llama 2, show remarkable proficiency in a wide range of natural language processing (NLP) tasks. Despite their effectiveness, the high costs associated with their use pose a challenge. We present LlamBERT, a hybrid approach that leverages LLMs to annotate a small subset of large, unlabeled databases and uses the results for fine-tuning transformer encoders like BERT and RoBERTa. This strategy is evaluated on two diverse datasets: the IMDb review dataset and the UMLS Meta-Thesaurus. Our results indicate that the LlamBERT approach slightly compromises on accuracy while offering much greater cost-effectiveness.

B\'alint Csan\'ady, Lajos Muzsai, P\'eter Vedres, Zolt\'an N\'adasdy, Andr\'as Luk\'acs• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	MNIST	Accuracy99.87	417
Sentiment Analysis	IMDB (test)	Accuracy96.68	306
Image Classification	Fashion MNIST	Accuracy96.91	240
UMLS classification	UMLS (test)	Accuracy96.92	9

Showing 4 of 4 rows

Other info

Code

Follow for update

@wizwand_team Discord