Ragas: Automated Evaluation of Retrieval Augmented Generation

About

We introduce Ragas (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With Ragas, we put forward a suite of metrics which can be used to evaluate these different dimensions \textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.

Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert• 2023

Related benchmarks

Task	Dataset	Result
Hallucination Detection	RAGTruth (test)	AUROC0.7541	99
Hallucination Detection	Dolly AC (test)	AUC36.28	33
Hallucination Detection	RAGTruth Llama2-7B (test)	Accuracy68.22	21
Hallucination Detection	Dolly Llama2-7B (test)	Acc65.6	21
Hallucination Detection	RAGTruth Llama2-13B (test)	Acc70.8	21
Hallucination Detection	Dolly Llama2-13B (test)	Accuracy64.8	21
Hallucination Detection	Dolly AC LLaMA3-8B	Recall80	19
Hallucination Detection	RAGTruth LLaMA2-7B	Recall0.6327	19
Hallucination Detection	RAGTruth LLaMA2-13B	Recall67.63	19
Hallucination Detection	Dolly AC LLaMA2-7B	Recall53.45	19

Showing 10 of 32 rows

Other info

Follow for update

@wizwand_team Discord