Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

About

Legal reasoning requires distinguishing changes that matter from those that do not. Legal AI should remain stable under legally irrelevant perturbations, but should change when perturbations alter legally material points. We formulate this requirement as a legal-relevance-sensitive evaluation problem: LLMs should only be sensitive to the legally relevant change. We introduce a unified evaluation suite covering should-change and should-not-change evaluation across judicial fairness, robustness, and statute-confusion scenarios. Our evaluation shows that existing legal LLMs are systematically sensitive to legally irrelevant variations and often fail to distinguish related legal elements and statutory rules. To mitigate these failures, we present LexGuard, an adversarial multi-agent framework grounded in formal reasoning. LexGuard formalizes statutes into executable constraints, uses adversarial agents to extract competing fact-statute arguments, and invokes SMT solvers to verify legal satisfaction and logical consistency. Experiments show that LexGuard improves legal reasoning reliability by reducing vulnerability to manipulative framing, improving disambiguation among similar statutes, limiting the influence of legally irrelevant attributes, and increasing consistency under benign reformulations. We show that legal trustworthiness requires not only accuracy, but calibrated sensitivity to legally material changes.

Chen Linze, Cai Yufan, Hou Zhe, Dong Jin Song• 2026

Related benchmarks

TaskDatasetResultRank
Criminal Sentencing and Legal Validity EvaluationLeCaRD v2
RMSE9.98
32
Criminal Sentencing, Legal Validity, and Suspect-Level Performance EvaluationLEEC
RMSE20.95
28
Specific Provision PredictionLeCaRD v2
Precision81.03
24
General Provision PredictionLeCaRD v2
Precision34
24
General Provision PredictionLEEC Suspect-Level
Precision64.05
8
Specific Provision PredictionLEEC Suspect-Level
Precision82.35
8
Suspect-level Provision PredictionLEEC
SusF198.8
8
Adversarial Robustness in Statute PredictionLegal Statute Dataset (Adversarial)
ASR0.4988
4
Counterfactual Statute PredictionLegal Statute Dataset (Counterfactual)
Overall Accuracy81.01
4
Legal Statute SelectionLegal Sensitivity Evaluation Confusing-Statute Clusters (RQ5 error analysis)
Positivity Rate88.71
4
Showing 10 of 10 rows

Other info

Follow for update