Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report
About
We present Foundation-Sec-8B-Reasoning, the first open-source native reasoning model for cybersecurity. Built upon our previously released Foundation-Sec-8B base model (derived from Llama-3.1-8B-Base), the model is trained through a two-stage process combining supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR). Our training leverages proprietary reasoning data spanning cybersecurity analysis, instruction-following, and mathematical reasoning. Evaluation across 10 cybersecurity benchmarks and 10 general-purpose benchmarks demonstrates performance competitive with significantly larger models on cybersecurity tasks while maintaining strong general capabilities. The model shows effective generalization on multi-hop reasoning tasks and strong safety performance when deployed with appropriate system prompts and guardrails. This work demonstrates that domain-specialized reasoning models can achieve strong performance on specialized tasks while maintaining broad general capabilities. We release the model publicly at https://huggingface.co/fdtn-ai/Foundation-Sec-8B-Reasoning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning | BBH | Accuracy69.9 | 507 | |
| Mathematical Reasoning | GSM8K | Accuracy (GSM8K)82.3 | 358 | |
| Instruction Following | IFEval | -- | 292 | |
| Instruction Following | AlpacaEval 2.0 | LC Win Rate62.6 | 281 | |
| Multi-hop Question Answering | 2WikiMultihopQA | -- | 278 | |
| Knowledge | MMLU | Accuracy68.3 | 71 | |
| Mathematical Reasoning | MATH | Score0.433 | 50 | |
| Knowledge | GPQA | Accuracy31.7 | 34 | |
| Coding | HumanEval | HumanEval Mean Score0.799 | 28 | |
| Long-context Question Answering | HotpotQA | Mean Score54.8 | 21 |