ChatQA: Surpassing GPT-4 on Conversational QA and RAG

About

In this work, we introduce ChatQA, a suite of models that outperform GPT-4 on retrieval-augmented generation (RAG) and conversational question answering (QA). To enhance generation, we propose a two-stage instruction tuning method that significantly boosts the performance of RAG. For effective retrieval, we introduce a dense retriever optimized for conversational QA, which yields results comparable to the alternative state-of-the-art query rewriting models, while substantially reducing deployment costs. We also present the ChatRAG Bench, which encompasses ten datasets covering comprehensive evaluations on RAG, table-related QA, arithmetic calculations, and scenarios involving unanswerable questions. Our ChatQA-1.0-70B (score: 54.14), built on Llama2, a weaker foundation model than GPT-4, can slightly outperform GPT-4-0613 (score: 53.90) and GPT-4-Turbo-2024-04-09 (score: 54.03) on the ChatRAG Bench, without relying on any synthetic data from OpenAI GPT models. Notably, the Llama3-ChatQA-1.5-70B model surpasses the accuracy of GPT-4-Turbo-2024-04-09, achieving a 4.4% improvement. To advance research in this field, we open-sourced the model weights, instruction tuning data, ChatRAG Bench, and retriever for the community: https://chatqa-project.github.io/.

Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro• 2024

Related benchmarks

Task	Dataset	Result
Multi-hop Question Answering	2WikiMultihopQA	EM34.9	559
Multi-hop Question Answering	HotpotQA	F1 Score54.4	294
Question Answering	PopQA	Accuracy59.8	186
Question Answering	TriviaQA	Accuracy91.4	117
Fact Verification	FEVER	Accuracy0.927	72
Question Answering	NQ (Natural Questions)	EM47	70
Question Answering	MuSiQue	Accuracy (ACC)75	36
Question Answering	SQuAD	Accuracy (ACC)77	27
Question Answering	RealtimeQA	Accuracy56.7	27
Question Answering	FaithEval	Accuracy56.2	27

Showing 10 of 31 rows

Other info

Code

Follow for update

@wizwand_team Discord