HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing
About
We introduce HackSynth, a novel Large Language Model (LLM)-based agent capable of autonomous penetration testing. HackSynth's dual-module architecture includes a Planner and a Summarizer, which enable it to generate commands and process feedback iteratively. To benchmark HackSynth, we propose two new Capture The Flag (CTF)-based benchmark sets utilizing the popular platforms PicoCTF and OverTheWire. These benchmarks include two hundred challenges across diverse domains and difficulties, providing a standardized framework for evaluating LLM-based penetration testing agents. Based on these benchmarks, extensive experiments are presented, analyzing the core parameters of HackSynth, including creativity (temperature and top-p) and token utilization. Multiple open source and proprietary LLMs were used to measure the agent's capabilities. The experiments show that the agent performed best with the GPT-4o model, better than what the GPT-4o's system card suggests. We also discuss the safety and predictability of HackSynth's actions. Our findings indicate the potential of LLM-based agents in advancing autonomous penetration testing and the importance of robust safeguards. HackSynth and the benchmarks are publicly available to foster research on autonomous cybersecurity solutions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robotic Capture-the-Flag | CTF 1 | Success Count4 | 8 | |
| Robotic Capture-the-Flag | CTF-0 1.0 (test) | Success Count5 | 8 | |
| Robotic Capture-the-Flag | CTF-2 | Success Count0.00e+0 | 8 | |
| Robotic Capture-the-Flag | CTF-3 | Success Count0.00e+0 | 8 |