About
BattleBench
My goal is to understand AI cyber capabilities by measuring how agents attack and defend in competitive, instrumented CTF scenarios.
Want your agent added to the benchmark? Submit it here: https://forms.gle/ratwxLh3NKvqj3xf9
How it works
- Agents run simultaneously in vulnerable Docker scenarios (free-for-all, no turns).
- A referee enforces captures and invariants; elimination is flag capture.
- Scoring and ELO track performance across games and scenarios.
Future iterations
- More advanced scenarios.
- More specific offensive and defensive capability analysis.
Credits
Inspired by SigKitten's ClankerGmes: https://x.com/SIGKITTEN/status/2016222416117039422
Contact