Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training
About
Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.9% improvement in the aggregate score, while reasoning distillation leads to a 15.8% gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community. For access to all datasets and model weights, please refer to https://huggingface.co/collections/trendmicro-ailab/primus-67b1fd27052b802b4af9d243.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning | HellaSwag (HS) | HellaSwag Accuracy31.28 | 209 | |
| Reasoning | WinoGrande (WG) | Accuracy40.41 | 168 | |
| Knowledge | MMLU | Accuracy45.55 | 161 | |
| Commonsense Reasoning | SocialIQA | Accuracy27.02 | 158 | |
| Mathematics | MATH | MATH Accuracy35.3 | 136 | |
| Commonsense Reasoning | CommonsenseQA | Accuracy (pass@1)40.46 | 108 | |
| Common Sense Reasoning | PIQA | Accuracy46.57 | 100 | |
| Story completion | StoryCloze | Accuracy38.45 | 80 | |
| Mathematical Reasoning | TheoremQA | Accuracy5.37 | 64 | |
| Chinese Knowledge | CEval | Accuracy36.8 | 28 |