Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Sober Look at Agentic Misalignment in Automated Workflows

About

We study a class of emergent misalignment in multi-agent systems (MAS), with a focus on automated workflows, which we refer to agentic misalignment. Although these systems can solve complex tasks, they often fail because agents act according to implicit proxy utilities that do not align with the intended human goals. We formally define these behaviors and analyze them within a Bayesian framework, showing that generic utilities naturally lead to posterior collapse of agents in automated workflows. To address this issue, we propose Agentic Evidence Attribution (AEA), a novel alignment paradigm that improves agent posteriors using context-specific evidence. AEA reasons over agent actions and provides structured evidence to correct misaligned behavior during collaboration. To better understand the role of evidence, we study two instantiations of AEA: self-reflection (internal evidence from the model) and weak-to-strong generalization (external evidence on the agentic trajectory). We show that a small evidence model effectively aligns the MAS by providing orthogonal failure attribution. Our results clarify the sources of agentic misalignment in automated workflows and show that evidence-based alignment can effectively improve agent collaboration and leads to reliable multi-agent systems built on automated workflows.

Wenqian Ye, Bo Yuan, Zhichao Xu, Yijun Tian, Yawei Wang, Henry Kautz, Aidong Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Reward ModelingRewardBench
Chat Score91.62
216
Mathematical ReasoningAIME 25
Accuracy46.6
112
Reward Modeling EvaluationRM-Bench
Chat Score70.63
69
Code GenerationHumanEval
HumanEval Accuracy96.3
49
Failure attributionWho&When
Agent Accuracy60.79
22
Tabular Data AnalysisDataBench
Accuracy35.6
20
Scientific ReasoningSciBench chemistry
Accuracy73.2
20
Scientific ReasoningSciBench Physics
Accuracy76.5
20
Showing 8 of 8 rows

Other info

Follow for update