What is eval-harness?

rewrite everything-claude-code for github-copilot

How do I install eval-harness?

Run the command: npx killer-skills add j7-dev/everything-github-copilot/eval-harness. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

Which IDEs are compatible with eval-harness?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

eval-harness

Install eval-harness, an AI agent skill for AI agent workflows and automation. Works with Claude Code, Cursor, and Windsurf with one-command setup.

SKILL.md

Readonly

Imported Repository Instructions

The section below is supporting source material from the upstream repository. Use the Killer-Skills review above as the primary decision layer.

Supporting Evidence

Eval Harness Skill

Name: eval-harness
Availability: InStock
Author: j7-dev

A formal evaluation framework for Copilot CLI sessions, implementing eval-driven development (EDD) principles.

When to Activate

Setting up eval-driven development (EDD) for AI-assisted workflows
Defining pass/fail criteria for Copilot CLI task completion
Measuring agent reliability with pass@k metrics
Creating regression test suites for prompt or agent changes
Benchmarking agent performance across model versions

Philosophy

Eval-Driven Development treats evals as the "unit tests of AI development":

Define expected behavior BEFORE implementation
Run evals continuously during development
Track regressions with each change
Use pass@k metrics for reliability measurement

Eval Types

Capability Evals

Test if Claude can do something it couldn't before:

markdown
1[CAPABILITY EVAL: feature-name]
2Task: Description of what Claude should accomplish
3Success Criteria:
4  - [ ] Criterion 1
5  - [ ] Criterion 2
6  - [ ] Criterion 3
7Expected Output: Description of expected result

Regression Evals

Ensure changes don't break existing functionality:

markdown
1[REGRESSION EVAL: feature-name]
2Baseline: SHA or checkpoint name
3Tests:
4  - existing-test-1: PASS/FAIL
5  - existing-test-2: PASS/FAIL
6  - existing-test-3: PASS/FAIL
7Result: X/Y passed (previously Y/Y)

Grader Types

1. Code-Based Grader

Deterministic checks using code:

bash
1# Check if file contains expected pattern
2grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
3
4# Check if tests pass
5npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
6
7# Check if build succeeds
8npm run build && echo "PASS" || echo "FAIL"

2. Model-Based Grader

Use Claude to evaluate open-ended outputs:

markdown
1[MODEL GRADER PROMPT]
2Evaluate the following code change:
31. Does it solve the stated problem?
42. Is it well-structured?
53. Are edge cases handled?
64. Is error handling appropriate?
7
8Score: 1-5 (1=poor, 5=excellent)
9Reasoning: [explanation]

3. Human Grader

Flag for manual review:

markdown
1[HUMAN REVIEW REQUIRED]
2Change: Description of what changed
3Reason: Why human review is needed
4Risk Level: LOW/MEDIUM/HIGH

Metrics

pass@k

"At least one success in k attempts"

pass@1: First attempt success rate
pass@3: Success within 3 attempts
Typical target: pass@3 > 90%

pass^k

"All k trials succeed"

Higher bar for reliability
pass^3: 3 consecutive successes
Use for critical paths

Eval Workflow

1. Define (Before Coding)

markdown
1## EVAL DEFINITION: feature-xyz
2
3### Capability Evals
41. Can create new user account
52. Can validate email format
63. Can hash password securely
7
8### Regression Evals
91. Existing login still works
102. Session management unchanged
113. Logout flow intact
12
13### Success Metrics
14- pass@3 > 90% for capability evals
15- pass^3 = 100% for regression evals

2. Implement

Write code to pass the defined evals.

3. Evaluate

bash
1# Run capability evals
2[Run each capability eval, record PASS/FAIL]
3
4# Run regression evals
5npm test -- --testPathPattern="existing"
6
7# Generate report

4. Report

markdown
1EVAL REPORT: feature-xyz
2========================
3
4Capability Evals:
5  create-user:     PASS (pass@1)
6  validate-email:  PASS (pass@2)
7  hash-password:   PASS (pass@1)
8  Overall:         3/3 passed
9
10Regression Evals:
11  login-flow:      PASS
12  session-mgmt:    PASS
13  logout-flow:     PASS
14  Overall:         3/3 passed
15
16Metrics:
17  pass@1: 67% (2/3)
18  pass@3: 100% (3/3)
19
20Status: READY FOR REVIEW

Integration Patterns

Pre-Implementation

/eval define feature-name

Creates eval definition file at .copilot/evals/feature-name.md

During Implementation

/eval check feature-name

Runs current evals and reports status

Post-Implementation

/eval report feature-name

Generates full eval report

Eval Storage

Store evals in project:

.copilot/
  evals/
    feature-xyz.md      # Eval definition
    feature-xyz.log     # Eval run history
    baseline.json       # Regression baselines

Best Practices

Define evals BEFORE coding - Forces clear thinking about success criteria
Run evals frequently - Catch regressions early
Track pass@k over time - Monitor reliability trends
Use code graders when possible - Deterministic > probabilistic
Human review for security - Never fully automate security checks
Keep evals fast - Slow evals don't get run
Version evals with code - Evals are first-class artifacts

Example: Adding Authentication

markdown
1## EVAL: add-authentication
2
3### Phase 1: Define (10 min)
4Capability Evals:
5- [ ] User can register with email/password
6- [ ] User can login with valid credentials
7- [ ] Invalid credentials rejected with proper error
8- [ ] Sessions persist across page reloads
9- [ ] Logout clears session
10
11Regression Evals:
12- [ ] Public routes still accessible
13- [ ] API responses unchanged
14- [ ] Database schema compatible
15
16### Phase 2: Implement (varies)
17[Write code]
18
19### Phase 3: Evaluate
20Run: /eval check add-authentication
21
22### Phase 4: Report
23EVAL REPORT: add-authentication
24==============================
25Capability: 5/5 passed (pass@3: 100%)
26Regression: 3/3 passed (pass^3: 100%)
27Status: SHIP IT

eval-harness — community eval-harness, everything-github-copilot, community, ide skills, Claude Code, Cursor, Windsurf

Об этом навыке

Killer-Skills Review

Зачем использовать этот навык

Подходит лучше всего

↓ Реализуемые кейсы использования for eval-harness

! Безопасность и ограничения

Why this page is reference-only

Source Boundary

Browser Sandbox Environment

⚡️ Ready to unleash?

FAQ & Installation Steps

? Frequently Asked Questions