Como instalar eval-harness?

Execute o comando: npx killer-skills add j7-dev/everything-github-copilot. Ele funciona com Cursor, Windsurf, VS Code, Claude Code e mais de 19 outros IDEs.

Quais IDEs são compatíveis com eval-harness?

Esta skill é compatível com Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use a CLI do Killer-Skills para uma instalação unificada.

eval-harness tem limitações?

Limitacao: Ensure changes don't break existing functionality:. Limitacao: Requires repository-specific context from the skill documentation. Limitacao: Works best when the underlying tools and dependencies are already configured.

eval-harness

Install eval-harness, an AI agent skill for AI agent workflows and automation. Explore features, use cases, limitations, and setup guidance.

SKILL.md

Readonly

Upstream Repository Material

The section below comes from the upstream repository. Use it as supporting material alongside the fit, use-case, and installation summary on this page.

Upstream Source

Eval Harness Skill

Name: eval-harness
Availability: InStock
Author: j7-dev

A formal evaluation framework for Copilot CLI sessions, implementing eval-driven development (EDD) principles.

When to Activate

Setting up eval-driven development (EDD) for AI-assisted workflows
Defining pass/fail criteria for Copilot CLI task completion
Measuring agent reliability with pass@k metrics
Creating regression test suites for prompt or agent changes
Benchmarking agent performance across model versions

Philosophy

Eval-Driven Development treats evals as the "unit tests of AI development":

Define expected behavior BEFORE implementation
Run evals continuously during development
Track regressions with each change
Use pass@k metrics for reliability measurement

Eval Types

Capability Evals

Test if Claude can do something it couldn't before:

markdown
1[CAPABILITY EVAL: feature-name]
2Task: Description of what Claude should accomplish
3Success Criteria:
4  - [ ] Criterion 1
5  - [ ] Criterion 2
6  - [ ] Criterion 3
7Expected Output: Description of expected result

Regression Evals

Ensure changes don't break existing functionality:

markdown
1[REGRESSION EVAL: feature-name]
2Baseline: SHA or checkpoint name
3Tests:
4  - existing-test-1: PASS/FAIL
5  - existing-test-2: PASS/FAIL
6  - existing-test-3: PASS/FAIL
7Result: X/Y passed (previously Y/Y)

Grader Types

1. Code-Based Grader

Deterministic checks using code:

bash
1# Check if file contains expected pattern
2grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
3
4# Check if tests pass
5npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
6
7# Check if build succeeds
8npm run build && echo "PASS" || echo "FAIL"

2. Model-Based Grader

Use Claude to evaluate open-ended outputs:

markdown
1[MODEL GRADER PROMPT]
2Evaluate the following code change:
31. Does it solve the stated problem?
42. Is it well-structured?
53. Are edge cases handled?
64. Is error handling appropriate?
7
8Score: 1-5 (1=poor, 5=excellent)
9Reasoning: [explanation]

3. Human Grader

Flag for manual review:

markdown
1[HUMAN REVIEW REQUIRED]
2Change: Description of what changed
3Reason: Why human review is needed
4Risk Level: LOW/MEDIUM/HIGH

Metrics

pass@k

"At least one success in k attempts"

pass@1: First attempt success rate
pass@3: Success within 3 attempts
Typical target: pass@3 > 90%

pass^k

"All k trials succeed"

Higher bar for reliability
pass^3: 3 consecutive successes
Use for critical paths

Eval Workflow

1. Define (Before Coding)

markdown
1## EVAL DEFINITION: feature-xyz
2
3### Capability Evals
41. Can create new user account
52. Can validate email format
63. Can hash password securely
7
8### Regression Evals
91. Existing login still works
102. Session management unchanged
113. Logout flow intact
12
13### Success Metrics
14- pass@3 > 90% for capability evals
15- pass^3 = 100% for regression evals

2. Implement

Write code to pass the defined evals.

3. Evaluate

bash
1# Run capability evals
2[Run each capability eval, record PASS/FAIL]
3
4# Run regression evals
5npm test -- --testPathPattern="existing"
6
7# Generate report

4. Report

markdown
1EVAL REPORT: feature-xyz
2========================
3
4Capability Evals:
5  create-user:     PASS (pass@1)
6  validate-email:  PASS (pass@2)
7  hash-password:   PASS (pass@1)
8  Overall:         3/3 passed
9
10Regression Evals:
11  login-flow:      PASS
12  session-mgmt:    PASS
13  logout-flow:     PASS
14  Overall:         3/3 passed
15
16Metrics:
17  pass@1: 67% (2/3)
18  pass@3: 100% (3/3)
19
20Status: READY FOR REVIEW

Integration Patterns

Pre-Implementation

/eval define feature-name

Creates eval definition file at .copilot/evals/feature-name.md

During Implementation

/eval check feature-name

Runs current evals and reports status

Post-Implementation

/eval report feature-name

Generates full eval report

Eval Storage

Store evals in project:

.copilot/
  evals/
    feature-xyz.md      # Eval definition
    feature-xyz.log     # Eval run history
    baseline.json       # Regression baselines

Best Practices

Define evals BEFORE coding - Forces clear thinking about success criteria
Run evals frequently - Catch regressions early
Track pass@k over time - Monitor reliability trends
Use code graders when possible - Deterministic > probabilistic
Human review for security - Never fully automate security checks
Keep evals fast - Slow evals don't get run
Version evals with code - Evals are first-class artifacts

Example: Adding Authentication

markdown
1## EVAL: add-authentication
2
3### Phase 1: Define (10 min)
4Capability Evals:
5- [ ] User can register with email/password
6- [ ] User can login with valid credentials
7- [ ] Invalid credentials rejected with proper error
8- [ ] Sessions persist across page reloads
9- [ ] Logout clears session
10
11Regression Evals:
12- [ ] Public routes still accessible
13- [ ] API responses unchanged
14- [ ] Database schema compatible
15
16### Phase 2: Implement (varies)
17[Write code]
18
19### Phase 3: Evaluate
20Run: /eval check add-authentication
21
22### Phase 4: Report
23EVAL REPORT: add-authentication
24==============================
25Capability: 5/5 passed (pass@3: 100%)
26Regression: 3/3 passed (pass^3: 100%)
27Status: SHIP IT

eval-harness — for Claude Code eval-harness, everything-github-copilot, community, for Claude Code, ide skills, Harness, formal, evaluation, framework, Copilot

Sobre este Skill

Recursos

# Tópicos principais

Skill Overview

Por que usar essa habilidade

Melhor para

↓ Casos de Uso Práticos for eval-harness

! Segurança e Limitações

About The Source

Browser Sandbox Environment

⚡️ Ready to unleash?

FAQ e etapas de instalação

? Perguntas frequentes

O que é eval-harness?