Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents

About

RAG-based question-answering (QA) in specialist domains faces a cold-start problem: lack of evaluative benchmarks and absence of labeled data for post-training. We present DoRA (Domain-oriented RAG Assessment), a novel benchmark construction and evaluation framework using only a small set of specialist domain documents. DoRA systematically generates synthetic QA training and evaluation datasets with auditable evidence across five domain-specific intents. To mitigate same-pipeline circularity, DoRA's training and test splits use different LLM families (Claude Sonnet for training; GPT-4o for test) drawn from disjoint seed-document corpora. Instantiated on 40 defense-related documents (written in English), DoRA yields ~6.6K curated instances. Compared against 8 LLM baselines over a benchmark of 1,259 samples, a LoRA-adapted Llama3.1-8B trained on the synthetic training set consistently improves performance over 6 coverage and faithfulness metrics, especially reducing hallucination by more than half under the default GTE retrieval setting, with gains persisting across alternative retrievers and prompting-based baselines. Defense-domain expertise is incorporated in three stages of our evaluation: (a) determining the quality of the synthetic QA generated by DoRA, (b) ascertaining the reliability of LLM-as-judge scores, and (c) evaluating the generalization of the QA pipeline on completely human-written QA examples. We position DoRA as a practical framework for specialist-domain RAG under domain shift, with defense as a high-stakes case study.

Bao Gia Doan, Aditya Joshi, Pantelis Elinas, Aarya Bodhankar, Oscar Leslie, Tom Marchant, Flora Salim• 2026

Related benchmarks

TaskDatasetResultRank
End-to-end Question AnsweringDoRA full (test)
Token F1 Score56.53
9
End-to-end Question AnsweringDoRA IBM Granite retriever (test)
Token F1 Score57.52
9
Question AnsweringDoRA Gold Evidence Context
Token F170.6
9
Question Answeringexpert-curated (test)
Token F131.65
4
Showing 4 of 4 rows

Other info

Follow for update