Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

About

Post-training is routinely evaluated through aggregate benchmark scores that treat multi-hop reasoning as a single capability -- as if a model that answers more questions correctly must be better at assembling facts. We show that this assumption can be misleading: recipes with statistically indistinguishable atomic knowledge produce composition behaviour separated by over 40 percentage points, a phenomenon we call composition collapse: the systematic failure to assemble stably-known facts into chains, invisible to aggregate metrics. We introduce a double-gate protocol that changes the estimand from an aggregate compositionality gap to residual composition failure conditioned on stable atomic access, decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth. On a benchmark of temporal factual chains spanning depths 2--11 across four post-training recipes, this decomposition reveals that post-training objectives shift composition capability in directions that aggregate metrics mask, and suggests that claims about multi-hop reasoning improvement should be accompanied by atomic-gate-controlled composition metrics. Diagnostic probes further show that a substantial share of measured composition failure reflects generation-time computation constraints rather than permanent inability to compose.

Zhe Yu, Wenpeng Xing, Yunzhao Wei, Jie Chen, Hongzhi Wang, Xuyang Teng, Meng Han• 2026

Related benchmarks

Task	Dataset	Result
RANK	D4 V2	--	12
Compositional Reasoning	Harder-set	--	6
Compositional Reasoning	D4 V2 (test)	--	4
Scientific-fact composition and temporal reasoning	E3 Cross-domain pilot	--	4
Short-chain composition	E2 jointly stable facts	--	4
SUCC.	D4 V2	--	4

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord