eval-harness 是什么？

rewrite everything-claude-code for github-copilot

如何安装 eval-harness？

运行命令：npx killer-skills add j7-dev/everything-github-copilot/eval-harness。支持 Cursor、Windsurf、VS Code、Claude Code 等 19+ IDE/Agent。

eval-harness 支持哪些 IDE 或 Agent？

该技能兼容 Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer。可使用 Killer-Skills CLI 一条命令通用安装。

eval-harness

安装 eval-harness，这是一款面向AI agent workflows and automation的 AI Agent Skill。查看评审结论、使用场景与安装路径。

SKILL.md

Readonly

Upstream Repository Material

The section below is imported from the upstream repository and should be treated as secondary evidence. Use the Killer-Skills review above as the primary layer for fit, risk, and installation decisions.

Supporting Evidence

Eval Harness Skill

Name: eval-harness
Availability: InStock
Author: j7-dev

A formal evaluation framework for Copilot CLI sessions, implementing eval-driven development (EDD) principles.

When to Activate

Setting up eval-driven development (EDD) for AI-assisted workflows
Defining pass/fail criteria for Copilot CLI task completion
Measuring agent reliability with pass@k metrics
Creating regression test suites for prompt or agent changes
Benchmarking agent performance across model versions

Philosophy

Eval-Driven Development treats evals as the "unit tests of AI development":

Define expected behavior BEFORE implementation
Run evals continuously during development
Track regressions with each change
Use pass@k metrics for reliability measurement

Eval Types

Capability Evals

Test if Claude can do something it couldn't before:

markdown
1[CAPABILITY EVAL: feature-name]
2Task: Description of what Claude should accomplish
3Success Criteria:
4  - [ ] Criterion 1
5  - [ ] Criterion 2
6  - [ ] Criterion 3
7Expected Output: Description of expected result

Regression Evals

Ensure changes don't break existing functionality:

markdown
1[REGRESSION EVAL: feature-name]
2Baseline: SHA or checkpoint name
3Tests:
4  - existing-test-1: PASS/FAIL
5  - existing-test-2: PASS/FAIL
6  - existing-test-3: PASS/FAIL
7Result: X/Y passed (previously Y/Y)

Grader Types

1. Code-Based Grader

Deterministic checks using code:

bash
1# Check if file contains expected pattern
2grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
3
4# Check if tests pass
5npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
6
7# Check if build succeeds
8npm run build && echo "PASS" || echo "FAIL"

2. Model-Based Grader

Use Claude to evaluate open-ended outputs:

markdown
1[MODEL GRADER PROMPT]
2Evaluate the following code change:
31. Does it solve the stated problem?
42. Is it well-structured?
53. Are edge cases handled?
64. Is error handling appropriate?
7
8Score: 1-5 (1=poor, 5=excellent)
9Reasoning: [explanation]

3. Human Grader

Flag for manual review:

markdown
1[HUMAN REVIEW REQUIRED]
2Change: Description of what changed
3Reason: Why human review is needed
4Risk Level: LOW/MEDIUM/HIGH

Metrics

pass@k

"At least one success in k attempts"

pass@1: First attempt success rate
pass@3: Success within 3 attempts
Typical target: pass@3 > 90%

pass^k

"All k trials succeed"

Higher bar for reliability
pass^3: 3 consecutive successes
Use for critical paths

Eval Workflow

1. Define (Before Coding)

markdown
1## EVAL DEFINITION: feature-xyz
2
3### Capability Evals
41. Can create new user account
52. Can validate email format
63. Can hash password securely
7
8### Regression Evals
91. Existing login still works
102. Session management unchanged
113. Logout flow intact
12
13### Success Metrics
14- pass@3 > 90% for capability evals
15- pass^3 = 100% for regression evals

2. Implement

Write code to pass the defined evals.

3. Evaluate

bash
1# Run capability evals
2[Run each capability eval, record PASS/FAIL]
3
4# Run regression evals
5npm test -- --testPathPattern="existing"
6
7# Generate report

4. Report

markdown
1EVAL REPORT: feature-xyz
2========================
3
4Capability Evals:
5  create-user:     PASS (pass@1)
6  validate-email:  PASS (pass@2)
7  hash-password:   PASS (pass@1)
8  Overall:         3/3 passed
9
10Regression Evals:
11  login-flow:      PASS
12  session-mgmt:    PASS
13  logout-flow:     PASS
14  Overall:         3/3 passed
15
16Metrics:
17  pass@1: 67% (2/3)
18  pass@3: 100% (3/3)
19
20Status: READY FOR REVIEW

Integration Patterns

Pre-Implementation

/eval define feature-name

Creates eval definition file at .copilot/evals/feature-name.md

During Implementation

/eval check feature-name

Runs current evals and reports status

Post-Implementation

/eval report feature-name

Generates full eval report

Eval Storage

Store evals in project:

.copilot/
  evals/
    feature-xyz.md      # Eval definition
    feature-xyz.log     # Eval run history
    baseline.json       # Regression baselines

Best Practices

Define evals BEFORE coding - Forces clear thinking about success criteria
Run evals frequently - Catch regressions early
Track pass@k over time - Monitor reliability trends
Use code graders when possible - Deterministic > probabilistic
Human review for security - Never fully automate security checks
Keep evals fast - Slow evals don't get run
Version evals with code - Evals are first-class artifacts

Example: Adding Authentication

markdown
1## EVAL: add-authentication
2
3### Phase 1: Define (10 min)
4Capability Evals:
5- [ ] User can register with email/password
6- [ ] User can login with valid credentials
7- [ ] Invalid credentials rejected with proper error
8- [ ] Sessions persist across page reloads
9- [ ] Logout clears session
10
11Regression Evals:
12- [ ] Public routes still accessible
13- [ ] API responses unchanged
14- [ ] Database schema compatible
15
16### Phase 2: Implement (varies)
17[Write code]
18
19### Phase 3: Evaluate
20Run: /eval check add-authentication
21
22### Phase 4: Report
23EVAL REPORT: add-authentication
24==============================
25Capability: 5/5 passed (pass@3: 100%)
26Regression: 3/3 passed (pass^3: 100%)
27Status: SHIP IT

eval-harness — community eval-harness, everything-github-copilot, community, ide skills

关于此技能

Killer-Skills Review

核心价值

适用 Agent 类型

↓ 赋予的主要能力 · eval-harness

! 使用限制与门槛

Why this page is reference-only

Source Boundary

先决定动作，再继续看上游仓库材料

先进入安装与验证

回到可信合集再做一次复核

如果要进团队流转，转到工作流集合

Browser Sandbox Environment

⚡️ Ready to unleash?

常见问题与安装步骤

? FAQ