loaditout.ai
SkillsPacksTrendingLeaderboardAPI DocsBlogSubmitRequestsCompareAgentsXPrivacyDisclaimer
{}loaditout.ai
Skills & MCPPacksBlog

agent-config-arena

MCP Tool

SingggggYee/agent-config-arena

Neutral benchmark arena that tests CLAUDE.md, AGENTS.md, and agent configs against each other on real coding tasks.

Install

$ npx loaditout add SingggggYee/agent-config-arena

Platform-specific configuration:

.claude/settings.json
{
  "mcpServers": {
    "agent-config-arena": {
      "command": "npx",
      "args": [
        "-y",
        "agent-config-arena"
      ]
    }
  }
}

Add the config above to .claude/settings.json under the mcpServers key.

About

Agent Config Arena

> Everyone shares their CLAUDE.md. Nobody benchmarks them. Until now.

[](LICENSE) [](https://nodejs.org) [](CONTRIBUTING.md)

Results: 3 Configs × 8 Tasks × Claude Code

| Config | Pass Rate | Avg Tokens | Avg Time | Avg Cost | Score | |--------|-----------|------------|----------|----------|-------| | token-efficient | 88% | 208k | 73.7s | $0.28 | 44 | | workflow-heavy | 86% | 205k | 90.0s | $0.31 | 40 | | baseline (no config) | 88% | 201k | 112.9s | $0.33 | 36 |

> Tested on 8 real coding tasks (REST API, refactoring, bug fix, CLI tool, data pipeline, test coverage, TS migration, performance optimization). Full results in LEADERBOARD.md.

Surprising findings
  • "Token-efficient" config is fastest and cheapest, but doesn't actually use fewer tokens. It wins on speed (35% faster than baseline) and cost, not token count.
  • Zero config (baseline) uses the fewest tokens. Adding instructions makes the model *more* verbose, not less.
  • "Workflow-heavy" (plan-first, TDD) has the lowest pass rate. More structure doesn't mean more correct.
  • All 3 configs failed the data pipeline task. The hardest tasks expose config limitations equally.
  • The biggest gap is speed, not accuracy. Pass rates are within 2%, but time ranges from 73s to 113s.

---

What is this?

A neutral, open-source benchmark that tests different coding agent configurations on the same real coding tasks -- then publishes the results.

Not a model benchmark. SWE-bench tests models. We test *configs*. Not a tool collection. awesome-claude-code collects tools. We *evaluate* them. Not a single config. claude-token-efficient ships one config. We pit configs *ag

Tags

ai-codingarenabenchmarkclaude-codeclaude-mdcodexcoding-agentconfigdeveloper-toolsmcp

Reviews

Loading reviews...

Quality Signals

0
Installs
Last updated16 days ago
Security: AREADME

Safety

Risk Levelmedium
Data Access
read
Network Accessnone

Details

Sourcegithub-crawl
Last commit4/3/2026
View on GitHub→

Embed Badge

[![Loaditout](https://loaditout.ai/api/badge/SingggggYee/agent-config-arena)](https://loaditout.ai/skills/SingggggYee/agent-config-arena)