Vexp-ai/vexp-swe-bench
Open benchmark for AI coding agents on SWE-bench Verified. Compare resolution rates, cost, and unique wins.
Platform-specific configuration:
{
"mcpServers": {
"vexp-swe-bench": {
"command": "npx",
"args": [
"-y",
"vexp-swe-bench"
]
}
}
}Add the config above to .claude/settings.json under the mcpServers key.
The open benchmark for AI coding agents — compare resolution rates, cost, and speed on real-world GitHub issues from SWE-bench Verified.
Benchmark any coding agent (Claude Code, Codex, Cursor, Augment, Windsurf, OpenHands, and more) on a curated 100-task subset of SWE-bench Verified. Captures pass@1 resolution rates, cost per task, duration, and token usage.
Default configuration: Claude Code + [vexp](https://vexp.dev) — context-aware code intelligence that delivers the highest resolution rate at the lowest cost per task.
Evaluated on a 100-task subset of SWE-bench Verified. All agents use Claude Opus 4.5 for a fair, apples-to-apples comparison.
| Agent | Pass@1 | $/task | Unique Wins | |-------|--------|--------|-------------| | vexp + Claude Code | 73.0% | $0.67 | 7–10 | | Live-SWE-Agent | 72.0% | $0.86 | — | | OpenHands | 70.0% | $1.77 | — | | Sonar Foundation | 70.0% | $1.98 | — |
> vexp resolves more issues at the lowest cost per task — 22% cheaper than the next best agent.
Generate comparison charts: node dist/cli.js compare results/swebench-2026-03-22.jsonl
External resolution data sourced from swe-bench/experiments. Cost data sourced from each agent's published benchmarks (see data sources below).
git clone https://github.com/Vexp-ai/vexp-swe-bench.git
cd vexp-swe-bench
# One command setup (Python >= 3.10, Node >= 18, Git required)
./setup.sh
# Run the benchmark
source .venv/bin/activate
node dist/cli.js runThe setup script handles Node dependencies, Python venv, pip packages, SWE-bench Verified dataset download, 100-task subset generation, and TypeScript build.
> Note: vexp Pro or Team plan is required to run with vexp enabled. The CLI will prompt you to activate a license at first run. Use code BENCHMARK at vexp.dev/#pricing f
Loading reviews...