Ragas: Automated Evaluation of Retrieval Augmented Generation
About
We introduce Ragas (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With Ragas, we put forward a suite of metrics which can be used to evaluate these different dimensions \textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Hallucination Detection | RAGTruth (test) | AUROC0.7541 | 83 | |
| Hallucination Detection | Dolly AC (test) | AUC36.28 | 33 | |
| Hallucination Detection | RAGTruth Llama2-7B (test) | Accuracy68.22 | 21 | |
| Hallucination Detection | Dolly Llama2-7B (test) | Acc65.6 | 21 | |
| Hallucination Detection | RAGTruth Llama2-13B (test) | Acc70.8 | 21 | |
| Hallucination Detection | Dolly Llama2-13B (test) | Accuracy64.8 | 21 | |
| Hallucination Detection | Dolly AC LLaMA3-8B | Recall80 | 19 | |
| Hallucination Detection | RAGTruth LLaMA2-7B | Recall0.6327 | 19 | |
| Hallucination Detection | RAGTruth LLaMA2-13B | Recall67.63 | 19 | |
| Hallucination Detection | Dolly AC LLaMA2-7B | Recall53.45 | 19 |