Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

About

LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insufficient: post-hoc diagnosis frequently misses the chance to flag risks while they are unfolding. We propose TRACES, a representation-based proactive auditor that learns prefix-level trajectory risk states from the hidden representations of an observer LLM. TRACES induces latent mechanism features from step representations and models their temporal evolution to estimate whether a partial trajectory is drifting toward unsafe behavior. To sidestep the cost and ambiguity of step-level risk annotation, TRACES is trained with weak trajectory-level supervision while still producing dense prefix-level risk estimates. Across multiple agent safety benchmarks, TRACES improves both full-trajectory safety prediction and proactive risk discrimination. Our analyses further suggest that these risk states can help train a safer agent, highlighting the broader potential of proactive auditing for long-horizon agent safety.

Jiaqian Li, Yanshu Li, Boxuan Zhang, Ruixiang Tang, Kuan-Hao Huang• 2026

Related benchmarks

TaskDatasetResultRank
Agent Safety AuditingATBench
Accuracy85.5
13
Agent Safety AuditingASSE-Safety
Accuracy85.4
13
Agent Safety AuditingASSE-Security
Accuracy97.6
13
Agent Safety AuditingASSE Strict
Accuracy82.1
13
Failure Mode PredictionATBench
Accuracy41
10
Real-world Harm PredictionATBench
Accuracy39
10
Risk Source PredictionATBench
Accuracy50
10
Showing 7 of 7 rows

Other info

Follow for update