Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Deepchecks: Evaluating Retrieval-Augmented Generation (RAG)

About

Large Language Models (LLMs) augmented with Retrieval-Augmented Generation (RAG) techniques are revolutionizing applications across multiple domains, such as healthcare, finance, and customer service. Despite their potential, evaluating RAG systems remains a complex challenge due to the stochastic nature of generated outputs and the intricate interplay between retrieval and generation components. This paper introduces Deepchecks, a comprehensive framework tailored for evaluating RAG applications. Deepchecks' evaluation framework addresses RAG applications evaluation through a multi-faceted approach, root cause analysis and production monitoring. By ensuring alignment with application-specific requirements, Deepchecks framework provides a robust foundation for assessing reliability, relevance, and user satisfaction in RAG systems.

Assaf Gerner, Netta Madvil, Nadav Barak, Alex Zaikman, Jonatan Liberman, Liron Hamra, Rotem Brazilay, Shay Tsadok, Yaron Friedman, Neal Harow, Noam Bresler, Shir Chorev, Philip Tannor, Lior Rokach• 2026

Related benchmarks

TaskDatasetResultRank
Factual Grounding EvaluationTRUE (sampled 100 entries from each of 11 datasets)
ROC-AUC0.86
3
Factual Grounding EvaluationSQuAD
ROC AUC92
3
Factual Grounding EvaluationPubMedQA
ROC-AUC0.84
3
RAG EvaluationRAG-dataset-12000
Accuracy96
3
RAG EvaluationSQuAD
Accuracy94
3
RAG EvaluationHAGRID
Accuracy94
3
RAG EvaluationClient 1 Tech Support
Accuracy70
3
RAG EvaluationClient 2 Job Profiles
Accuracy81
3
Showing 8 of 8 rows

Other info

Follow for update