Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

About

Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.9% improvement in the aggregate score, while reasoning distillation leads to a 15.8% gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community. For access to all datasets and model weights, please refer to https://huggingface.co/collections/trendmicro-ailab/primus-67b1fd27052b802b4af9d243.

Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, Wen-Kwang Tsao• 2025

Related benchmarks

TaskDatasetResultRank
Cybersecurity Knowledge Question AnsweringMMLU CSec
CSec Score79
17
Cybersecurity EvaluationScEva
MCQ Score61.15
17
Cybersecurity Knowledge and Malware Extraction AnalysisSECURE
KCV82.65
17
General Language Understanding and ReasoningOpen LLM Leaderboard Lighteval (test)
Mean Accuracy66.71
17
Cybersecurity Knowledge EvaluationCyMtc (500)
CyMtc (500) Score83.8
17
Cybersecurity BenchmarkingScBen En
En64.91
17
Cybersecurity Multiple Choice Question AnsweringRedSage-MCQ 0-shot (test)
Macro Accuracy77.02
17
Overall Cybersecurity PerformanceCybersecurity Multi-Benchmark Suite
Overall Mean Score71.69
17
Cybersecurity Threat Intelligence AnalysisCTI-Bench
MCQ Score55.92
17
Cybersecurity EvaluationCybersecurity Benchmarks CTI-MCQ, CyberMetric, SecEval (test)
CTI-MCQ Score66.6
5
Showing 10 of 10 rows

Other info

Follow for update