eval-harness — community eval-harness, everything-github-copilot, community, ide skills, Claude Code, Cursor, Windsurf

v1.0.0

Об этом навыке

rewrite everything-claude-code for github-copilot

j7-dev j7-dev
[8]
[1]
Updated: 4/2/2026

Killer-Skills Review

Decision support comes first. Repository text comes second.

Reference-Only Page Review Score: 1/11

This page remains useful for operators, but Killer-Skills treats it as reference material instead of a primary organic landing page.

Review Score
1/11
Quality Score
34
Canonical Locale
en
Detected Body Locale
en

rewrite everything-claude-code for github-copilot

Зачем использовать этот навык

rewrite everything-claude-code for github-copilot

Подходит лучше всего

Suitable for operator workflows that need explicit guardrails before installation and execution.

Реализуемые кейсы использования for eval-harness

! Безопасность и ограничения

Why this page is reference-only

  • - Current locale does not satisfy the locale-governance contract.
  • - The page lacks a strong recommendation layer.
  • - The page lacks concrete use-case guidance.
  • - The page lacks explicit limitations or caution signals.
  • - The underlying skill quality score is below the review floor.

Source Boundary

The section below is supporting source material from the upstream repository. Use the Killer-Skills review above as the primary decision layer.

Labs Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is eval-harness?

rewrite everything-claude-code for github-copilot

How do I install eval-harness?

Run the command: npx killer-skills add j7-dev/everything-github-copilot/eval-harness. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

Which IDEs are compatible with eval-harness?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add j7-dev/everything-github-copilot/eval-harness. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use eval-harness immediately in the current project.

! Reference-Only Mode

This page remains useful for installation and reference, but Killer-Skills no longer treats it as a primary indexable landing page. Read the review above before relying on the upstream repository instructions.

Imported Repository Instructions

The section below is supporting source material from the upstream repository. Use the Killer-Skills review above as the primary decision layer.

Supporting Evidence

eval-harness

Install eval-harness, an AI agent skill for AI agent workflows and automation. Works with Claude Code, Cursor, and Windsurf with one-command setup.

SKILL.md
Readonly
Imported Repository Instructions
The section below is supporting source material from the upstream repository. Use the Killer-Skills review above as the primary decision layer.
Supporting Evidence

Eval Harness Skill

A formal evaluation framework for Copilot CLI sessions, implementing eval-driven development (EDD) principles.

When to Activate

  • Setting up eval-driven development (EDD) for AI-assisted workflows
  • Defining pass/fail criteria for Copilot CLI task completion
  • Measuring agent reliability with pass@k metrics
  • Creating regression test suites for prompt or agent changes
  • Benchmarking agent performance across model versions

Philosophy

Eval-Driven Development treats evals as the "unit tests of AI development":

  • Define expected behavior BEFORE implementation
  • Run evals continuously during development
  • Track regressions with each change
  • Use pass@k metrics for reliability measurement

Eval Types

Capability Evals

Test if Claude can do something it couldn't before:

markdown
1[CAPABILITY EVAL: feature-name] 2Task: Description of what Claude should accomplish 3Success Criteria: 4 - [ ] Criterion 1 5 - [ ] Criterion 2 6 - [ ] Criterion 3 7Expected Output: Description of expected result

Regression Evals

Ensure changes don't break existing functionality:

markdown
1[REGRESSION EVAL: feature-name] 2Baseline: SHA or checkpoint name 3Tests: 4 - existing-test-1: PASS/FAIL 5 - existing-test-2: PASS/FAIL 6 - existing-test-3: PASS/FAIL 7Result: X/Y passed (previously Y/Y)

Grader Types

1. Code-Based Grader

Deterministic checks using code:

bash
1# Check if file contains expected pattern 2grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL" 3 4# Check if tests pass 5npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL" 6 7# Check if build succeeds 8npm run build && echo "PASS" || echo "FAIL"

2. Model-Based Grader

Use Claude to evaluate open-ended outputs:

markdown
1[MODEL GRADER PROMPT] 2Evaluate the following code change: 31. Does it solve the stated problem? 42. Is it well-structured? 53. Are edge cases handled? 64. Is error handling appropriate? 7 8Score: 1-5 (1=poor, 5=excellent) 9Reasoning: [explanation]

3. Human Grader

Flag for manual review:

markdown
1[HUMAN REVIEW REQUIRED] 2Change: Description of what changed 3Reason: Why human review is needed 4Risk Level: LOW/MEDIUM/HIGH

Metrics

pass@k

"At least one success in k attempts"

  • pass@1: First attempt success rate
  • pass@3: Success within 3 attempts
  • Typical target: pass@3 > 90%

pass^k

"All k trials succeed"

  • Higher bar for reliability
  • pass^3: 3 consecutive successes
  • Use for critical paths

Eval Workflow

1. Define (Before Coding)

markdown
1## EVAL DEFINITION: feature-xyz 2 3### Capability Evals 41. Can create new user account 52. Can validate email format 63. Can hash password securely 7 8### Regression Evals 91. Existing login still works 102. Session management unchanged 113. Logout flow intact 12 13### Success Metrics 14- pass@3 > 90% for capability evals 15- pass^3 = 100% for regression evals

2. Implement

Write code to pass the defined evals.

3. Evaluate

bash
1# Run capability evals 2[Run each capability eval, record PASS/FAIL] 3 4# Run regression evals 5npm test -- --testPathPattern="existing" 6 7# Generate report

4. Report

markdown
1EVAL REPORT: feature-xyz 2======================== 3 4Capability Evals: 5 create-user: PASS (pass@1) 6 validate-email: PASS (pass@2) 7 hash-password: PASS (pass@1) 8 Overall: 3/3 passed 9 10Regression Evals: 11 login-flow: PASS 12 session-mgmt: PASS 13 logout-flow: PASS 14 Overall: 3/3 passed 15 16Metrics: 17 pass@1: 67% (2/3) 18 pass@3: 100% (3/3) 19 20Status: READY FOR REVIEW

Integration Patterns

Pre-Implementation

/eval define feature-name

Creates eval definition file at .copilot/evals/feature-name.md

During Implementation

/eval check feature-name

Runs current evals and reports status

Post-Implementation

/eval report feature-name

Generates full eval report

Eval Storage

Store evals in project:

.copilot/
  evals/
    feature-xyz.md      # Eval definition
    feature-xyz.log     # Eval run history
    baseline.json       # Regression baselines

Best Practices

  1. Define evals BEFORE coding - Forces clear thinking about success criteria
  2. Run evals frequently - Catch regressions early
  3. Track pass@k over time - Monitor reliability trends
  4. Use code graders when possible - Deterministic > probabilistic
  5. Human review for security - Never fully automate security checks
  6. Keep evals fast - Slow evals don't get run
  7. Version evals with code - Evals are first-class artifacts

Example: Adding Authentication

markdown
1## EVAL: add-authentication 2 3### Phase 1: Define (10 min) 4Capability Evals: 5- [ ] User can register with email/password 6- [ ] User can login with valid credentials 7- [ ] Invalid credentials rejected with proper error 8- [ ] Sessions persist across page reloads 9- [ ] Logout clears session 10 11Regression Evals: 12- [ ] Public routes still accessible 13- [ ] API responses unchanged 14- [ ] Database schema compatible 15 16### Phase 2: Implement (varies) 17[Write code] 18 19### Phase 3: Evaluate 20Run: /eval check add-authentication 21 22### Phase 4: Report 23EVAL REPORT: add-authentication 24============================== 25Capability: 5/5 passed (pass@3: 100%) 26Regression: 3/3 passed (pass^3: 100%) 27Status: SHIP IT

Связанные навыки

Looking for an alternative to eval-harness or another community skill for your workflow? Explore these related open-source skills.

Показать все

openclaw-release-maintainer

Logo of openclaw
openclaw

Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞

widget-generator

Logo of f
f

Создание настраиваемых плагинов виджетов для системы ленты новостей prompts.chat

flags

Logo of vercel
vercel

Фреймворк React

138.4k
0
Браузер

pr-review

Logo of pytorch
pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

98.6k
0
Разработчик