Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights

About

Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains task-specific experts. We show that in small models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models the density of task-experts increases dramatically, so that diverse, task-improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post-training method that samples $N$ parameter perturbations at random, selects the top $K$, and ensembles predictions via majority vote. Despite its simplicity, this approach is competitive with standard post-training methods such as PPO, GRPO, and ES for contemporary large-scale models.

Yulu Gan, Phillip Isola• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy73.7	442
Mathematical Reasoning	GSM8K	Accuracy89.5	388
Mathematical Reasoning	Countdown	Accuracy85	252
Coding	MBPP	Accuracy75.9	145
Mathematical Reasoning	OlyBench	Accuracy39.2	59
Writing	ROCStories	Accuracy64.5	48
Chemistry	USPTO	Accuracy44.3	48
Visual Reasoning	GQA (test)	Accuracy69	24

Showing 8 of 8 rows

Other info

GitHub

Follow for update

@wizwand_team Discord