Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

About

Post-training is routinely evaluated through aggregate benchmark scores that treat multi-hop reasoning as a single capability -- as if a model that answers more questions correctly must be better at assembling facts. We show that this assumption can be misleading: recipes with statistically indistinguishable atomic knowledge produce composition behaviour separated by over 40 percentage points, a phenomenon we call composition collapse: the systematic failure to assemble stably-known facts into chains, invisible to aggregate metrics. We introduce a double-gate protocol that changes the estimand from an aggregate compositionality gap to residual composition failure conditioned on stable atomic access, decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth. On a benchmark of temporal factual chains spanning depths 2--11 across four post-training recipes, this decomposition reveals that post-training objectives shift composition capability in directions that aggregate metrics mask, and suggests that claims about multi-hop reasoning improvement should be accompanied by atomic-gate-controlled composition metrics. Diagnostic probes further show that a substantial share of measured composition failure reflects generation-time computation constraints rather than permanent inability to compose.

Zhe Yu, Wenpeng Xing, Yunzhao Wei, Jie Chen, Hongzhi Wang, Xuyang Teng, Meng Han• 2026

Related benchmarks

TaskDatasetResultRank
RANKD4 V2--
12
Compositional ReasoningHarder-set--
6
Compositional ReasoningD4 V2 (test)--
4
Scientific-fact composition and temporal reasoningE3 Cross-domain pilot--
4
Short-chain compositionE2 jointly stable facts--
4
SUCC.D4 V2--
4
Showing 6 of 6 rows

Other info

Follow for update