Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

About

Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making processes remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to externalize LLMs' thinking, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to improve the monitorability of their latent thinking. However, previous methods only try to develop external modules instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method, TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the effectiveness of TELLME on detoxification tasks, where LLMs achieve consistent improvement among multimodal test sets, distinct architectures, and varying parameter scales. We further analyze TELLME's improvement on LLMs' generalization ability from both optimal transport theory and empirical perspectives.

Guanxu Chen, Jing Shao, Tao Luo, Lijie Hu, Qihao Lin, Dongrui Liu• 2025

Related benchmarks

TaskDatasetResultRank
Misalignment DetectionTaylor
Accuracy99
63
Multi-risk safety monitoringBeavertails
Accuracy (%)80.1
63
Safe-or-harmful binary classificationBeavertails
Accuracy84.6
63
Safety EvaluationXSTest (test)
XSTest Score18.4
36
Safety EvaluationSafeBench
Overall Safety Score99
19
General Capability EvaluationGeneral Capability Suite
Average Score71
12
Over-Safety EvaluationXSTest (XST)
XST Over-Safety23.2
12
Safety EvaluationBeaverTails (BT)
BT Score99.1
12
Safety EvaluationStrongReject SB
SB Score99.1
12
General Capability EvaluationGeneral Capability Dataset
General Score66.7
10
Showing 10 of 11 rows

Other info

Follow for update