Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

About

Chain-of-thought (CoT) reasoning improves the problem-solving ability of large language models (LLMs), but generated reasoning traces may not faithfully reflect the model's actual decision process. Existing CoT unfaithfulness detectors mainly rely on external signals from generated rationales, such as textual plausibility or answer consistency, while overlooking evidence from the model's internal computation. Although recent circuit tracing methods provide a way to obtain model-internal evidence by tracing how information flows through model components during reasoning, constructing full reasoning circuits for long CoTs is costly and difficult to scale. To address these challenges, we propose Circuit-guided Internal-External Discrepancy Scorer (CIE-Scorer), a framework for instance-level CoT unfaithfulness detection. The key idea is that faithful reasoning traces should align with the model's computational process, whereas unfaithful traces may diverge from it. CIE-Scorer efficiently traces compact sentence-level circuits from informative reasoning tokens, constructs internal and external reasoning graphs, and measures their discrepancy using Fused Gromov--Wasserstein distance. Experiments on four datasets from FaithCoT-Bench show that CIE-Scorer achieves state-of-the-art performance while reducing the cost of circuit construction, demonstrating the effectiveness of combining mechanistic interpretability signals with external reasoning traces for CoT unfaithfulness detection.

Xu Shen, Zhen Tan, Song Wang, Pingjun Hong, Rui Miao, Xin Wang, Tianlong Chen• 2026

Related benchmarks

TaskDatasetResultRank
CoT faithfulness detectionTruthful QA
Accuracy78
12
CoT faithfulness detectionAQUA
Accuracy (CoT Faithfulness)77
12
CoT faithfulness detectionLogic-QA
Accuracy69
11
CoT faithfulness detectionHLE Bio
Accuracy78
11
Showing 4 of 4 rows

Other info

Follow for update