Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

About

Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making processes remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to externalize LLMs' thinking, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to improve the monitorability of their latent thinking. However, previous methods only try to develop external modules instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method, TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the effectiveness of TELLME on detoxification tasks, where LLMs achieve consistent improvement among multimodal test sets, distinct architectures, and varying parameter scales. We further analyze TELLME's improvement on LLMs' generalization ability from both optimal transport theory and empirical perspectives.

Guanxu Chen, Jing Shao, Tao Luo, Lijie Hu, Qihao Lin, Dongrui Liu• 2025

Related benchmarks

Task	Dataset	Result
Misalignment Detection	Taylor	Accuracy99	63
Multi-risk safety monitoring	Beavertails	Accuracy (%)80.1	63
Safe-or-harmful binary classification	Beavertails	Accuracy84.6	63
Safety Evaluation	XSTest (test)	XSTest Score18.4	36
Safety Evaluation	SafeBench	Overall Safety Score99	19
General Capability Evaluation	General Capability Suite	Average Score71	12
Over-Safety Evaluation	XSTest (XST)	XST Over-Safety23.2	12
Safety Evaluation	BeaverTails (BT)	BT Score99.1	12
Safety Evaluation	StrongReject SB	SB Score99.1	12
General Capability Evaluation	General Capability Dataset	General Score66.7	10

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord