doanbactam/eval-bench
Which AI model is actually best for YOUR codebase? Record real tasks. Compare models. Stop guessing.
Platform-specific configuration:
{
"mcpServers": {
"eval-bench": {
"command": "npx",
"args": [
"-y",
"eval-bench"
]
}
}
}Add the config above to .claude/settings.json under the mcpServers key.
> Which AI model is actually best for YOUR codebase? Record real tasks. Compare models. Stop guessing.
An MCP server that sits alongside your coding agent. It records task metadata — NOT code — and shows you which model performs best on YOUR specific task types.
NEVER recorded:
Recorded (metadata only):
All data stored locally at ~/.eval-bench/data.db. Nothing is uploaded without explicit opt-in.
bun install
bun src/index.tsx installThis adds the MCP server to:
~/.config/opencode/config.json~/.claude.json# View your personal dashboard (static)
bun src/index.tsx report
# Live dashboard with real-time updates (NEW!)
bun src/index.tsx watch
# Show all recorded data (privacy transparency)
bun src/index.tsx show-data
# Compare two models
bun src/index.tsx compare claude-sonnet-4-6 claude-opus-4-6Your AI model performance — last 30 days
────────────────────────────────────────
Tasks recorded: 127 | Repo: TypeScript, large
┌────────────────────────────────────────────────────────────────┐
│ Task Type Best Model Success Rate Avg Time │
├────────────────────────────────────────────────────────────────┤
│ Refactoring Sonnet 4.6 ✓ 84% 4.2 min │
│ Debugging Opus 4.6 ✓ 79% 7.8 min │
│ New Feature Opus 4.6 ✓ 68% 12.1 min │
│ Tests Sonnet 4.6 ✓ 91% 2.9 min │
│ Documentation Sonnet 4.6 ✓ 96% 1.8 min │
└────────────────────────────────────────────────────────────────┘
Recommendation:
Use Sonnet for refacLoading reviews...