eval-harness — community eval-harness, everything-github-copilot, community, ide skills

v1.0.0

关于此技能

rewrite everything-claude-code for github-copilot

j7-dev j7-dev
[8]
[1]
更新于: 4/2/2026

Killer-Skills Review

Decision support comes first. Repository text comes second.

Reference-Only Page Review Score: 1/11

This page remains useful for operators, but Killer-Skills treats it as reference material instead of a primary organic landing page.

Review Score
1/11
Quality Score
34
Canonical Locale
en
Detected Body Locale
en

rewrite everything-claude-code for github-copilot

核心价值

rewrite everything-claude-code for github-copilot

适用 Agent 类型

Suitable for operator workflows that need explicit guardrails before installation and execution.

赋予的主要能力 · eval-harness

! 使用限制与门槛

Why this page is reference-only

  • - Current locale does not satisfy the locale-governance contract.
  • - The page lacks a strong recommendation layer.
  • - The page lacks concrete use-case guidance.
  • - The page lacks explicit limitations or caution signals.
  • - The underlying skill quality score is below the review floor.

Source Boundary

The section below is imported from the upstream repository and should be treated as secondary evidence. Use the Killer-Skills review above as the primary layer for fit, risk, and installation decisions.

评审后的下一步

先决定动作,再继续看上游仓库材料

Killer-Skills 的主价值不应该停在“帮你打开仓库说明”,而是先帮你判断这项技能是否值得安装、是否应该回到可信集合复核,以及是否已经进入工作流落地阶段。

实验室 Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

常见问题与安装步骤

以下问题与步骤与页面结构化数据保持一致,便于搜索引擎理解页面内容。

? FAQ

eval-harness 是什么?

rewrite everything-claude-code for github-copilot

如何安装 eval-harness?

运行命令:npx killer-skills add j7-dev/everything-github-copilot/eval-harness。支持 Cursor、Windsurf、VS Code、Claude Code 等 19+ IDE/Agent。

eval-harness 支持哪些 IDE 或 Agent?

该技能兼容 Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer。可使用 Killer-Skills CLI 一条命令通用安装。

安装步骤

  1. 1. 打开终端

    在你的项目目录中打开终端或命令行。

  2. 2. 执行安装命令

    运行:npx killer-skills add j7-dev/everything-github-copilot/eval-harness。CLI 会自动识别 IDE 或 AI Agent 并完成配置。

  3. 3. 开始使用技能

    eval-harness 已启用,可立即在当前项目中调用。

! 参考页模式

此页面仍可作为安装与查阅参考,但 Killer-Skills 不再把它视为主要可索引落地页。请优先阅读上方评审结论,再决定是否继续查看上游仓库说明。

Upstream Repository Material

The section below is imported from the upstream repository and should be treated as secondary evidence. Use the Killer-Skills review above as the primary layer for fit, risk, and installation decisions.

Upstream Source

eval-harness

安装 eval-harness,这是一款面向AI agent workflows and automation的 AI Agent Skill。查看评审结论、使用场景与安装路径。

SKILL.md
Readonly
Upstream Repository Material
The section below is imported from the upstream repository and should be treated as secondary evidence. Use the Killer-Skills review above as the primary layer for fit, risk, and installation decisions.
Supporting Evidence

Eval Harness Skill

A formal evaluation framework for Copilot CLI sessions, implementing eval-driven development (EDD) principles.

When to Activate

  • Setting up eval-driven development (EDD) for AI-assisted workflows
  • Defining pass/fail criteria for Copilot CLI task completion
  • Measuring agent reliability with pass@k metrics
  • Creating regression test suites for prompt or agent changes
  • Benchmarking agent performance across model versions

Philosophy

Eval-Driven Development treats evals as the "unit tests of AI development":

  • Define expected behavior BEFORE implementation
  • Run evals continuously during development
  • Track regressions with each change
  • Use pass@k metrics for reliability measurement

Eval Types

Capability Evals

Test if Claude can do something it couldn't before:

markdown
1[CAPABILITY EVAL: feature-name] 2Task: Description of what Claude should accomplish 3Success Criteria: 4 - [ ] Criterion 1 5 - [ ] Criterion 2 6 - [ ] Criterion 3 7Expected Output: Description of expected result

Regression Evals

Ensure changes don't break existing functionality:

markdown
1[REGRESSION EVAL: feature-name] 2Baseline: SHA or checkpoint name 3Tests: 4 - existing-test-1: PASS/FAIL 5 - existing-test-2: PASS/FAIL 6 - existing-test-3: PASS/FAIL 7Result: X/Y passed (previously Y/Y)

Grader Types

1. Code-Based Grader

Deterministic checks using code:

bash
1# Check if file contains expected pattern 2grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL" 3 4# Check if tests pass 5npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL" 6 7# Check if build succeeds 8npm run build && echo "PASS" || echo "FAIL"

2. Model-Based Grader

Use Claude to evaluate open-ended outputs:

markdown
1[MODEL GRADER PROMPT] 2Evaluate the following code change: 31. Does it solve the stated problem? 42. Is it well-structured? 53. Are edge cases handled? 64. Is error handling appropriate? 7 8Score: 1-5 (1=poor, 5=excellent) 9Reasoning: [explanation]

3. Human Grader

Flag for manual review:

markdown
1[HUMAN REVIEW REQUIRED] 2Change: Description of what changed 3Reason: Why human review is needed 4Risk Level: LOW/MEDIUM/HIGH

Metrics

pass@k

"At least one success in k attempts"

  • pass@1: First attempt success rate
  • pass@3: Success within 3 attempts
  • Typical target: pass@3 > 90%

pass^k

"All k trials succeed"

  • Higher bar for reliability
  • pass^3: 3 consecutive successes
  • Use for critical paths

Eval Workflow

1. Define (Before Coding)

markdown
1## EVAL DEFINITION: feature-xyz 2 3### Capability Evals 41. Can create new user account 52. Can validate email format 63. Can hash password securely 7 8### Regression Evals 91. Existing login still works 102. Session management unchanged 113. Logout flow intact 12 13### Success Metrics 14- pass@3 > 90% for capability evals 15- pass^3 = 100% for regression evals

2. Implement

Write code to pass the defined evals.

3. Evaluate

bash
1# Run capability evals 2[Run each capability eval, record PASS/FAIL] 3 4# Run regression evals 5npm test -- --testPathPattern="existing" 6 7# Generate report

4. Report

markdown
1EVAL REPORT: feature-xyz 2======================== 3 4Capability Evals: 5 create-user: PASS (pass@1) 6 validate-email: PASS (pass@2) 7 hash-password: PASS (pass@1) 8 Overall: 3/3 passed 9 10Regression Evals: 11 login-flow: PASS 12 session-mgmt: PASS 13 logout-flow: PASS 14 Overall: 3/3 passed 15 16Metrics: 17 pass@1: 67% (2/3) 18 pass@3: 100% (3/3) 19 20Status: READY FOR REVIEW

Integration Patterns

Pre-Implementation

/eval define feature-name

Creates eval definition file at .copilot/evals/feature-name.md

During Implementation

/eval check feature-name

Runs current evals and reports status

Post-Implementation

/eval report feature-name

Generates full eval report

Eval Storage

Store evals in project:

.copilot/
  evals/
    feature-xyz.md      # Eval definition
    feature-xyz.log     # Eval run history
    baseline.json       # Regression baselines

Best Practices

  1. Define evals BEFORE coding - Forces clear thinking about success criteria
  2. Run evals frequently - Catch regressions early
  3. Track pass@k over time - Monitor reliability trends
  4. Use code graders when possible - Deterministic > probabilistic
  5. Human review for security - Never fully automate security checks
  6. Keep evals fast - Slow evals don't get run
  7. Version evals with code - Evals are first-class artifacts

Example: Adding Authentication

markdown
1## EVAL: add-authentication 2 3### Phase 1: Define (10 min) 4Capability Evals: 5- [ ] User can register with email/password 6- [ ] User can login with valid credentials 7- [ ] Invalid credentials rejected with proper error 8- [ ] Sessions persist across page reloads 9- [ ] Logout clears session 10 11Regression Evals: 12- [ ] Public routes still accessible 13- [ ] API responses unchanged 14- [ ] Database schema compatible 15 16### Phase 2: Implement (varies) 17[Write code] 18 19### Phase 3: Evaluate 20Run: /eval check add-authentication 21 22### Phase 4: Report 23EVAL REPORT: add-authentication 24============================== 25Capability: 5/5 passed (pass@3: 100%) 26Regression: 3/3 passed (pass^3: 100%) 27Status: SHIP IT

相关技能

寻找 eval-harness 的替代方案 (Alternative) 或可搭配使用的同类 community Skill?探索以下相关开源技能。

查看全部

openclaw-release-maintainer

Logo of openclaw
openclaw

Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞

333.8k
0
AI

widget-generator

Logo of f
f

为prompts.chat的信息反馈系统生成可定制的插件小部件

149.6k
0
AI

flags

Logo of vercel
vercel

React 框架

138.4k
0
浏览器

pr-review

Logo of pytorch
pytorch

Python中具有强大GPU加速的张量和动态神经网络

98.6k
0
开发者工具