Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

About

Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.9% improvement in the aggregate score, while reasoning distillation leads to a 15.8% gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community. For access to all datasets and model weights, please refer to https://huggingface.co/collections/trendmicro-ailab/primus-67b1fd27052b802b4af9d243.

Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, Wen-Kwang Tsao• 2025

Related benchmarks

Task	Dataset	Result
Reasoning	HellaSwag (HS)	HellaSwag Accuracy31.28	209
Mathematics	MATH	MATH Accuracy35.3	172
Reasoning	WinoGrande (WG)	Accuracy40.41	172
Knowledge	MMLU	Accuracy45.55	171
Commonsense Reasoning	SocialIQA	Accuracy27.02	164
Commonsense Reasoning	CommonsenseQA	Accuracy (pass@1)40.46	108
Common Sense Reasoning	PIQA	Accuracy46.57	100
Story completion	StoryCloze	Accuracy38.45	80
Mathematical Reasoning	TheoremQA	Accuracy5.37	67
Reading Comprehension	SQuAD v2	F1 Score38.55	38

Showing 10 of 44 rows

Other info

Follow for update

@wizwand_team Discord