agent-eval — agent-eval AI agent skill agent-eval, everything-claude-code, affaan-m, official, agent-eval AI agent skill, ai agent skill, ide skills, agent automation, agent-eval for Claude Code, AI agent skills, Claude Code, Cursor

Verified
v1.0.0
GitHub

About this Skill

Perfect for Coding Agents needing comprehensive performance comparisons and data-driven decision-making on custom tasks. Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

# Core Topics

affaan-m affaan-m
[108.5k]
[14167]
Updated: 3/26/2026

Quality Score

Top 5%
74
Excellent
Based on code quality & docs
Installation
SYS Universal Install (Auto-Detect)
> npx killer-skills add affaan-m/everything-claude-code/agent-eval
Supports 19+ Platforms
Cursor
Windsurf
VS Code
Trae
Claude
OpenClaw
+12 more

Agent Capability Analysis

The agent-eval skill by affaan-m is an open-source official AI agent skill for Claude Code and other IDE workflows, helping agents execute tasks with better context, repeatability, and domain-specific guidance. Optimized for agent-eval AI agent skill, agent-eval for Claude Code.

Ideal Agent Persona

Perfect for Coding Agents needing comprehensive performance comparisons and data-driven decision-making on custom tasks.

Core Value

Empowers agents to conduct head-to-head comparisons of coding agents like Claude Code, Aider, and Codex, using metrics such as pass rate, cost, time, and consistency, and supports reproducible tasks through a lightweight CLI tool, facilitating data-backed agent selection decisions.

Capabilities Granted for agent-eval

Comparing coding agents on custom codebases
Measuring agent performance before adopting new tools or models
Running regression checks when an agent updates its model or tooling

! Prerequisites & Limits

  • Requires custom tasks for comparison
  • Limited to coding agents with CLI compatibility
  • Dependent on agent-specific metrics for evaluation
Project
SKILL.md
4.2 KB
.cursorrules
1.2 KB
package.json
240 B
Ready
UTF-8
SKILL.md
Readonly

Agent Eval Skill

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

When to Activate

  • Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
  • Measuring agent performance before adopting a new tool or model
  • Running regression checks when an agent updates its model or tooling
  • Producing data-backed agent selection decisions for a team

Installation

Note: Install agent-eval from its repository after reviewing the source.

Core Concepts

YAML Task Definitions

Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:

yaml
1name: add-retry-logic 2description: Add exponential backoff retry to the HTTP client 3repo: ./my-project 4files: 5 - src/http_client.py 6prompt: | 7 Add retry logic with exponential backoff to all HTTP requests. 8 Max 3 retries. Initial delay 1s, max delay 30s. 9judge: 10 - type: pytest 11 command: pytest tests/test_http_client.py -v 12 - type: grep 13 pattern: "exponential_backoff|retry" 14 files: src/http_client.py 15commit: "abc1234" # pin to specific commit for reproducibility

Git Worktree Isolation

Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.

Metrics Collected

MetricWhat It Measures
Pass rateDid the agent produce code that passes the judge?
CostAPI spend per task (when available)
TimeWall-clock seconds to completion
ConsistencyPass rate across repeated runs (e.g., 3/3 = 100%)

Workflow

1. Define Tasks

Create a tasks/ directory with YAML files, one per task:

bash
1mkdir tasks 2# Write task definitions (see template above)

2. Run Agents

Execute agents against your tasks:

bash
1agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

Each run:

  1. Creates a fresh git worktree from the specified commit
  2. Hands the prompt to the agent
  3. Runs the judge criteria
  4. Records pass/fail, cost, and time

3. Compare Results

Generate a comparison report:

bash
1agent-eval report --format table
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

Judge Types

Code-Based (deterministic)

yaml
1judge: 2 - type: pytest 3 command: pytest tests/ -v 4 - type: command 5 command: npm run build

Pattern-Based

yaml
1judge: 2 - type: grep 3 pattern: "class.*Retry" 4 files: src/**/*.py

Model-Based (LLM-as-judge)

yaml
1judge: 2 - type: llm 3 prompt: | 4 Does this implementation correctly handle exponential backoff? 5 Check for: max retries, increasing delays, jitter.

Best Practices

  • Start with 3-5 tasks that represent your real workload, not toy examples
  • Run at least 3 trials per agent to capture variance — agents are non-deterministic
  • Pin the commit in your task YAML so results are reproducible across days/weeks
  • Include at least one deterministic judge (tests, build) per task — LLM judges add noise
  • Track cost alongside pass rate — a 95% agent at 10x the cost may not be the right choice
  • Version your task definitions — they are test fixtures, treat them as code

Links

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is agent-eval?

Perfect for Coding Agents needing comprehensive performance comparisons and data-driven decision-making on custom tasks. Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

How do I install agent-eval?

Run the command: npx killer-skills add affaan-m/everything-claude-code/agent-eval. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for agent-eval?

Key use cases include: Comparing coding agents on custom codebases, Measuring agent performance before adopting new tools or models, Running regression checks when an agent updates its model or tooling.

Which IDEs are compatible with agent-eval?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for agent-eval?

Requires custom tasks for comparison. Limited to coding agents with CLI compatibility. Dependent on agent-specific metrics for evaluation.

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add affaan-m/everything-claude-code/agent-eval. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use agent-eval immediately in the current project.

Related Skills

Looking for an alternative to agent-eval or another official skill for your workflow? Explore these related open-source skills.

View All

flags

Logo of facebook
facebook

Use when you need to check feature flag states, compare channels, or debug why a feature behaves differently across release channels.

244.2k
0
Design

extract-errors

Logo of facebook
facebook

extract-errors is a React error handling skill that automates the process of extracting and assigning error codes, ensuring accurate and up-to-date error messages in React applications.

244.2k
0
Design

fix

Logo of facebook
facebook

fix is a code optimization skill that automates formatting and linting using yarn prettier and linc.

244.2k
0
Design

flow

Logo of facebook
facebook

Use when you need to run Flow type checking, or when seeing Flow type errors in React code.

244.2k
0
Design