Induction Signatures Are Not Enough: A Matched-Compute Study of Load-Bearing Structure in In-Context Learning

About

Mechanism-targeted synthetic data is increasingly proposed as a way to steer pretraining toward desirable capabilities, but it remains unclear how such interventions should be evaluated. We study this question for in-context learning (ICL) under matched compute (iso-FLOPs) using Bi-Induct, a lightweight data rewrite that interleaves short directional copy snippets into a natural pretraining stream: forward-copy (induction), backward-copy (anti-induction, as a directional control), or a balanced mix. Across 0.13B-1B decoder-only models, we evaluate (i) few-shot performance on standard LM benchmarks and function-style ICL probes, (ii) head-level copy telemetry, and (iii) held-out perplexity as a guardrail. Bi-Induct reliably increases induction-head activity, but this does not translate into consistent improvements in few-shot generalization: on standard LM benchmarks, Bi-Induct is largely performance-neutral relative to natural-only training, while on function-style probes the 1B natural-only model performs best. Despite explicit backward-copy cues, anti-induction scores remain near zero across scales, revealing a strong forward/backward asymmetry. Targeted ablations show a sharper distinction: removing the top 2% induction heads per layer harms ICL more than matched random ablations, with the largest relative drop occurring in the natural-only models. This indicates that natural-only training produces more centralized, load-bearing induction circuitry, whereas Bi-Induct tends to create more distributed and redundant induction activity. Our main conclusion is that eliciting a mechanism is not the same as making it load-bearing. For data-centric foundation model design, this suggests that synthetic data interventions should be evaluated not only by signature amplification, but by whether they create causally necessary computation while preserving natural-data modeling quality.

Mohammed Sabry, Anya Belz• 2025

Related benchmarks

Task	Dataset	Result
Multi-task Language Understanding	MMLU (test)	--	87
Question Answering	TriviaQA Wiki (val)	Exact Match (EM)30	52
Language Modeling	The Pile (eval)	Perplexity (PPL)14.9	12
In-Context Learning Aggregate Evaluation	ICL composite Standard Benchmarks	Macro Accuracy24.3	4
In-Context Learning Aggregate Evaluation for Probes	ICL composite Probes	Macro Accuracy15.2	4
Function-style In-Context Learning Probes	Function-style Probes	Accuracy (Alphabetically First, k=3)15.3	2

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord