Deepchecks: Evaluating Retrieval-Augmented Generation (RAG)
About
Large Language Models (LLMs) augmented with Retrieval-Augmented Generation (RAG) techniques are revolutionizing applications across multiple domains, such as healthcare, finance, and customer service. Despite their potential, evaluating RAG systems remains a complex challenge due to the stochastic nature of generated outputs and the intricate interplay between retrieval and generation components. This paper introduces Deepchecks, a comprehensive framework tailored for evaluating RAG applications. Deepchecks' evaluation framework addresses RAG applications evaluation through a multi-faceted approach, root cause analysis and production monitoring. By ensuring alignment with application-specific requirements, Deepchecks framework provides a robust foundation for assessing reliability, relevance, and user satisfaction in RAG systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Factual Grounding Evaluation | TRUE (sampled 100 entries from each of 11 datasets) | ROC-AUC0.86 | 3 | |
| Factual Grounding Evaluation | SQuAD | ROC AUC92 | 3 | |
| Factual Grounding Evaluation | PubMedQA | ROC-AUC0.84 | 3 | |
| RAG Evaluation | RAG-dataset-12000 | Accuracy96 | 3 | |
| RAG Evaluation | SQuAD | Accuracy94 | 3 | |
| RAG Evaluation | HAGRID | Accuracy94 | 3 | |
| RAG Evaluation | Client 1 Tech Support | Accuracy70 | 3 | |
| RAG Evaluation | Client 2 Job Profiles | Accuracy81 | 3 |