Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Target-Oriented Pretraining Data Selection via Neuron-Activated Graph

About

Everyday tasks come with a target, and pretraining models around this target is what turns them into experts. In this paper, we study target-oriented language model (LM) pretraining by introducing Neuron-Activated Graph Ranking (NAG-based Ranking), a training-free and interpretable framework for target pretraining data selection. Rather than using black-box representations, our approach directly characterizes each target input by a sparse set of high-impact neurons in any off-the-shelf LLMs. Concretely, we quantify neuron impact and select the most influential neurons across layers into a compact Neuron-Activated Graph (NAG), and rank candidate data by NAG similarity to target examples. We conduct experiments across six benchmarks, where our NAG-based Ranking improves target-oriented pretraining by 4.9% on average over random sampling, and also outperforms state-of-the-art baselines by 5.3% accuracy on HellaSwag. It also remains effective under a more applicable multi-target setting, where our best setup surpasses two baselines by 1.1% and 4.1%, respectively. Furthermore, we provide a comprehensive analysis on why and how our NAG works, e.g., deactivating NAG-selected neurons (only 0.12% of all) causes a 23.5% performance collapse, and restricting NAG to the final layer incurs a 4.1% average drop, indicating that NAG captures a sparse "functional backbone" for learning target features. We release the code at https://github.com/asillycat/NAG.

Zijun Wang, Haoqin Tu, Weidong Zhou, Yiyang Zhou, Xiaohuan Zhou, Bingni Zhang, Weiguo Feng, Taifeng Wang, Cihang Xie, Fengze Liu• 2026

Related benchmarks

TaskDatasetResultRank
ReasoningARC Challenge
Accuracy35
81
Story ReasoningXStoryCloze
Accuracy70.8
51
Factual Question AnsweringTriviaQA
Accuracy22.6
46
Commonsense ReasoningHellaSwag
Accuracy60.6
19
Commonsense ReasoningXWinograd
Accuracy80.6
13
Multiple-Choice ReasoningMMLU
Accuracy32.2
10
Showing 6 of 6 rows

Other info

GitHub

Follow for update