knewstimek/crackme-agent-bench
Reverse engineering benchmark for AI agents -- test your agent's ability to use debuggers and static analysis tools to crack multi-stage binaries
Platform-specific configuration:
{
"mcpServers": {
"crackme-agent-bench": {
"command": "npx",
"args": [
"-y",
"crackme-agent-bench"
]
}
}
}Add the config above to .claude/settings.json under the mcpServers key.
Reverse engineering benchmark for AI agents.
A collection of crackme binaries designed to test an AI agent's ability to autonomously reverse engineer and extract hidden flags using debugger and static analysis tools.
Unlike traditional CTF challenges aimed at humans, these are specifically crafted to evaluate how well AI agents can:
As AI agents gain access to development tools via protocols like MCP (Model Context Protocol), we need benchmarks that go beyond text-based reasoning. Reverse engineering is a uniquely demanding task -- it requires tool mastery, low-level systems knowledge, and creative problem-solving all at once.
This benchmark answers the question: "Can your AI agent actually use a debugger to solve a real reverse engineering challenge?"
These challenges are designed to be solved using MCP-based tools:
| Tool | Description | |------|-------------| | veh-debugger | VEH-based in-process debugger exposed as MCP tools (launch, breakpoints, step, memory read/write, registers, disassembly) | | agent-tool | General-purpose MCP toolkit with static binary analysis (PE info, disassembly, strings, hex dump, entropy) |
You can also use any other debugger or analysis tool your agent has access to.
| Challenge | Difficulty | Architecture | Description | |-----------|-----------|--------------|-------------| | v1 | 0.5 / 10.0 | x86 / x64 | Multi-stage crackme with anti-debug, stage chaining, and memory-only flag |
Difficulty is rated on a 0-10 scale from an AI agent's perspective (not hu
Loading reviews...