L-MARS: Legal Multi-Agent System with Agentic Search and Citation-Faithfulness Audit

About

Large language models are increasingly deployed for legal question answering, where evaluations typically focus on multiple-choice accuracy. This measure overlooks a common failure: whether the citation source attached to an answer exists and supports the rule the system attributes to it. We present L-MARS, an open multi-agent legal QA system with agentic search and judge-driven evidence checks, and audit it claim by claim against its cited source. Each atomic claim is labelled with a six-class taxonomy and scored with strict-ALCE under cross-provider judging, where the answerer and verifier come from different model families. On a stratified 100-question Bar Exam audit, retrieval barely moves accuracy, yet the multi-turn judge loop lifts strict citation F1 from 0.13 (naive RAG) to 0.25 and cuts the no-citation rate from 34% to 13%. We further introduce Faith-Search, a post-draft step that re-verifies and repairs unreachable citations; it drops the unreachable rate below 1% but does not improve F1 over the multi-turn loop, so we report it as a targeted reachability intervention rather than a faithfulness breakthrough. A 50-question LegalSearchQA case study confirms the picture: retrieve-then-draft pipelines saturate near 0.75 citation F1, while a single-agent web-search baseline collapses to 0.22 under external audit.

Boqin Yuan, Ziqi Wang• 2025

Related benchmarks

Task	Dataset	Result	Rank
Legal Question Answering	Bar Exam QA	Accuracy55.9		5
Legal Question Answering	LegalSearchQA (50 questions)	Accuracy96		3

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord