crackme-agent-bench

MCP Tool

knewstimek/crackme-agent-bench

Reverse engineering benchmark for AI agents -- test your agent's ability to use debuggers and static analysis tools to crack multi-stage binaries

Install

$ npx loaditout add knewstimek/crackme-agent-bench

Platform-specific configuration:

.claude/settings.json

{
  "mcpServers": {
    "crackme-agent-bench": {
      "command": "npx",
      "args": [
        "-y",
        "crackme-agent-bench"
      ]
    }
  }
}

Add the config above to .claude/settings.json under the mcpServers key.

About

crackme-agent-bench

Reverse engineering benchmark for AI agents.

What is this?

A collection of crackme binaries designed to test an AI agent's ability to autonomously reverse engineer and extract hidden flags using debugger and static analysis tools.

Unlike traditional CTF challenges aimed at humans, these are specifically crafted to evaluate how well AI agents can:

Analyze unknown binaries through disassembly and static analysis
Operate a debugger (set breakpoints, step, read memory, patch registers)
Identify and bypass anti-debugging protections
Chain multi-step reasoning to solve staged challenges
Extract secrets that are never printed to stdout

Why?

As AI agents gain access to development tools via protocols like MCP (Model Context Protocol), we need benchmarks that go beyond text-based reasoning. Reverse engineering is a uniquely demanding task -- it requires tool mastery, low-level systems knowledge, and creative problem-solving all at once.

This benchmark answers the question: "Can your AI agent actually use a debugger to solve a real reverse engineering challenge?"

Tools

These challenges are designed to be solved using MCP-based tools:

| Tool | Description | |------|-------------| | veh-debugger | VEH-based in-process debugger exposed as MCP tools (launch, breakpoints, step, memory read/write, registers, disassembly) | | agent-tool | General-purpose MCP toolkit with static binary analysis (PE info, disassembly, strings, hex dump, entropy) |

You can also use any other debugger or analysis tool your agent has access to.

Challenges

| Challenge | Difficulty | Architecture | Description | |-----------|-----------|--------------|-------------| | v1 | 0.5 / 10.0 | x86 / x64 | Multi-stage crackme with anti-debug, stage chaining, and memory-only flag |

Difficulty is rated on a 0-10 scale from an AI agent's perspective (not hu

Reviews

Loading reviews...

Quality Signals

Installs

Last updated1 day ago

Security: BREADME

New

crackme-agent-bench

Install

About

Tags

Reviews

Quality Signals

Safety

Details

Embed Badge