Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Induction Signatures Are Not Enough: A Matched-Compute Study of Load-Bearing Structure in In-Context Learning

About

Mechanism-targeted synthetic data is increasingly proposed as a way to steer pretraining toward desirable capabilities, but it remains unclear how such interventions should be evaluated. We study this question for in-context learning (ICL) under matched compute (iso-FLOPs) using Bi-Induct, a lightweight data rewrite that interleaves short directional copy snippets into a natural pretraining stream: forward-copy (induction), backward-copy (anti-induction, as a directional control), or a balanced mix. Across 0.13B-1B decoder-only models, we evaluate (i) few-shot performance on standard LM benchmarks and function-style ICL probes, (ii) head-level copy telemetry, and (iii) held-out perplexity as a guardrail. Bi-Induct reliably increases induction-head activity, but this does not translate into consistent improvements in few-shot generalization: on standard LM benchmarks, Bi-Induct is largely performance-neutral relative to natural-only training, while on function-style probes the 1B natural-only model performs best. Despite explicit backward-copy cues, anti-induction scores remain near zero across scales, revealing a strong forward/backward asymmetry. Targeted ablations show a sharper distinction: removing the top 2% induction heads per layer harms ICL more than matched random ablations, with the largest relative drop occurring in the natural-only models. This indicates that natural-only training produces more centralized, load-bearing induction circuitry, whereas Bi-Induct tends to create more distributed and redundant induction activity. Our main conclusion is that eliciting a mechanism is not the same as making it load-bearing. For data-centric foundation model design, this suggests that synthetic data interventions should be evaluated not only by signature amplification, but by whether they create causally necessary computation while preserving natural-data modeling quality.

Mohammed Sabry, Anya Belz• 2025

Related benchmarks

TaskDatasetResultRank
Multi-task Language UnderstandingMMLU (test)--
76
Question AnsweringTriviaQA Wiki (val)
Exact Match (EM)30
52
Language ModelingThe Pile (eval)
Perplexity (PPL)14.9
12
In-Context Learning Aggregate EvaluationICL composite Standard Benchmarks
Macro Accuracy24.3
4
In-Context Learning Aggregate Evaluation for ProbesICL composite Probes
Macro Accuracy15.2
4
Function-style In-Context Learning ProbesFunction-style Probes
Accuracy (Alphabetically First, k=3)15.3
2
Showing 6 of 6 rows

Other info

Follow for update