agent-evaluation — community agent-evaluation, agent-studio, community, ide skills

v1.2.0

关于此技能

适用于需要高级内容质量验证和评分能力的AI代理,例如Cursor、Windsurf和Claude Code。 LLM-as-judge evaluation framework with 5-dimension rubric (accuracy, groundedness, coherence, completeness, helpfulness) for scoring AI-generated content quality with weighted composite scores and evidence citations

oimiragieo oimiragieo
[14]
[0]
更新于: 3/2/2026

Killer-Skills Review

Decision support comes first. Repository text comes second.

Reference-Only Page Review Score: 9/11

This page remains useful for operators, but Killer-Skills treats it as reference material instead of a primary organic landing page.

Original recommendation layer Concrete use-case guidance Explicit limitations and caution Quality floor passed for review
Review Score
9/11
Quality Score
64
Canonical Locale
en
Detected Body Locale
en

适用于需要高级内容质量验证和评分能力的AI代理,例如Cursor、Windsurf和Claude Code。 LLM-as-judge evaluation framework with 5-dimension rubric (accuracy, groundedness, coherence, completeness, helpfulness) for scoring AI-generated content quality with weighted composite scores and evidence citations

核心价值

赋予代理系统atically评估AI生成的内容的准确性、扎实性、连贯性、完整性和有用性,使用加权复合评分,利用LLM-as-judge评估框架和协议,如LangChain。

适用 Agent 类型

适用于需要高级内容质量验证和评分能力的AI代理,例如Cursor、Windsurf和Claude Code。

赋予的主要能力 · agent-evaluation

评估AutoGPT生成的文本的准确性和连贯性
评估Claude Code输出的完整性和有用性
验证Windsurf生成的内容的扎实性和相关性

! 使用限制与门槛

  • 需要与AI代理输出生成管道集成
  • 仅限于基于文本的内容评估
  • 依赖于高质量的训练数据用于LLM-as-judge模型

Why this page is reference-only

  • - Current locale does not satisfy the locale-governance contract.

Source Boundary

The section below is imported from the upstream repository and should be treated as secondary evidence. Use the Killer-Skills review above as the primary layer for fit, risk, and installation decisions.

评审后的下一步

先决定动作,再继续看上游仓库材料

Killer-Skills 的主价值不应该停在“帮你打开仓库说明”,而是先帮你判断这项技能是否值得安装、是否应该回到可信集合复核,以及是否已经进入工作流落地阶段。

实验室 Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

常见问题与安装步骤

以下问题与步骤与页面结构化数据保持一致,便于搜索引擎理解页面内容。

? FAQ

agent-evaluation 是什么?

适用于需要高级内容质量验证和评分能力的AI代理,例如Cursor、Windsurf和Claude Code。 LLM-as-judge evaluation framework with 5-dimension rubric (accuracy, groundedness, coherence, completeness, helpfulness) for scoring AI-generated content quality with weighted composite scores and evidence citations

如何安装 agent-evaluation?

运行命令:npx killer-skills add oimiragieo/agent-studio。支持 Cursor、Windsurf、VS Code、Claude Code 等 19+ IDE/Agent。

agent-evaluation 适用于哪些场景?

典型场景包括:评估AutoGPT生成的文本的准确性和连贯性、评估Claude Code输出的完整性和有用性、验证Windsurf生成的内容的扎实性和相关性。

agent-evaluation 支持哪些 IDE 或 Agent?

该技能兼容 Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer。可使用 Killer-Skills CLI 一条命令通用安装。

agent-evaluation 有哪些限制?

需要与AI代理输出生成管道集成;仅限于基于文本的内容评估;依赖于高质量的训练数据用于LLM-as-judge模型。

安装步骤

  1. 1. 打开终端

    在你的项目目录中打开终端或命令行。

  2. 2. 执行安装命令

    运行:npx killer-skills add oimiragieo/agent-studio。CLI 会自动识别 IDE 或 AI Agent 并完成配置。

  3. 3. 开始使用技能

    agent-evaluation 已启用,可立即在当前项目中调用。

! 参考页模式

此页面仍可作为安装与查阅参考,但 Killer-Skills 不再把它视为主要可索引落地页。请优先阅读上方评审结论,再决定是否继续查看上游仓库说明。

Upstream Repository Material

The section below is imported from the upstream repository and should be treated as secondary evidence. Use the Killer-Skills review above as the primary layer for fit, risk, and installation decisions.

Upstream Source

agent-evaluation

安装 agent-evaluation,这是一款面向AI agent workflows and automation的 AI Agent Skill。查看评审结论、使用场景与安装路径。

SKILL.md
Readonly
Upstream Repository Material
The section below is imported from the upstream repository and should be treated as secondary evidence. Use the Killer-Skills review above as the primary layer for fit, risk, and installation decisions.
Supporting Evidence

Agent Evaluation

Overview

LLM-as-judge evaluation framework that scores AI-generated content on 5 dimensions using a 1-5 rubric. Agents evaluate outputs, compute a weighted composite score, and emit a structured verdict with evidence citations.

Core principle: Systematic quality verification before claiming completion. Agent-studio currently has no way to verify agent output quality — this skill fills that gap.

When to Use

Always:

  • Before marking a task complete (pair with verification-before-completion)
  • After a plan is generated (evaluate plan quality)
  • After code review outputs (evaluate review quality)
  • During reflection cycles (evaluate agent responses)
  • When comparing multiple agent outputs

Don't Use:

  • For binary pass/fail checks (use verification-before-completion instead)
  • For security audits (use security-architect skill)
  • For syntax/lint checking (use pnpm lint:fix)

The 5-Dimension Rubric

Every evaluation scores all 5 dimensions on a 1-5 scale:

DimensionWeightWhat It Measures
Accuracy30%Factual correctness; no hallucinations; claims are verifiable
Groundedness25%Claims are supported by citations, file references, or evidence from the codebase
Coherence15%Logical flow; internally consistent; no contradictions
Completeness20%All required aspects addressed; no critical gaps
Helpfulness10%Actionable; provides concrete next steps; reduces ambiguity

Scoring Scale (1-5)

ScoreMeaning
5Excellent — fully meets the dimension's criteria with no gaps
4Good — meets criteria with minor gaps
3Adequate — partially meets criteria; some gaps present
2Poor — significant gaps or errors in this dimension
1Failing — does not meet the dimension's criteria

Execution Process

Step 1: Load the Output to Evaluate

Identify what is being evaluated:

- Agent response (text)
- Plan document (file path)
- Code review output (text/file)
- Skill invocation result (text)
- Task completion claim (TaskGet metadata)

Step 2: Score Each Dimension

For each of the 5 dimensions, provide:

  1. Score (1-5): The numeric score
  2. Evidence: Direct quote or file reference from the evaluated output
  3. Rationale: Why this score was given (1-2 sentences)

Dimension 1: Accuracy

Checklist:
- [ ] Claims are factually correct (verify against codebase if possible)
- [ ] No hallucinated file paths, function names, or API calls
- [ ] Numbers and counts are accurate
- [ ] No contradictions with existing documentation

Dimension 2: Groundedness

Checklist:
- [ ] Claims cite specific files, line numbers, or task IDs
- [ ] Recommendations reference observable evidence
- [ ] No unsupported assertions ("this is probably X")
- [ ] Code examples use actual project patterns

Dimension 3: Coherence

Checklist:
- [ ] Logical flow from problem → analysis → recommendation
- [ ] No internal contradictions
- [ ] Terminology is consistent throughout
- [ ] Steps are in a rational order

Dimension 4: Completeness

Checklist:
- [ ] All required aspects of the task are addressed
- [ ] Edge cases are mentioned (if relevant)
- [ ] No critical gaps that would block action
- [ ] Follow-up steps are included

Dimension 5: Helpfulness

Checklist:
- [ ] Provides actionable next steps (not just observations)
- [ ] Concrete enough to act on without further clarification
- [ ] Reduces ambiguity rather than adding it
- [ ] Appropriate for the intended audience

Step 3: Compute Weighted Composite Score

composite = (accuracy × 0.30) + (groundedness × 0.25) + (completeness × 0.20) + (coherence × 0.15) + (helpfulness × 0.10)

Step 4: Determine Verdict

Composite ScoreVerdictAction
4.5 – 5.0EXCELLENTApprove; proceed
3.5 – 4.4GOODApprove with minor notes
2.5 – 3.4ADEQUATERequest targeted improvements
1.5 – 2.4POORReject; requires significant rework
1.0 – 1.4FAILINGReject; restart task

Step 5: Emit Structured Verdict

Output the verdict in this format:

markdown
1## Evaluation Verdict 2 3**Output Evaluated**: [Brief description of what was evaluated] 4**Evaluator**: [Agent name / task ID] 5**Date**: [ISO 8601 date] 6 7### Dimension Scores 8 9| Dimension | Score | Weight | Weighted Score | 10| ------------- | ----- | ------ | -------------- | 11| Accuracy | X/5 | 30% | X.X | 12| Groundedness | X/5 | 25% | X.X | 13| Completeness | X/5 | 20% | X.X | 14| Coherence | X/5 | 15% | X.X | 15| Helpfulness | X/5 | 10% | X.X | 16| **Composite** | | | **X.X / 5.0** | 17 18### Evidence Citations 19 20**Accuracy (X/5)**: 21 22> [Direct quote or file:line reference] 23> Rationale: [Why this score] 24 25**Groundedness (X/5)**: 26 27> [Direct quote or file:line reference] 28> Rationale: [Why this score] 29 30**Completeness (X/5)**: 31 32> [Direct quote or file:line reference] 33> Rationale: [Why this score] 34 35**Coherence (X/5)**: 36 37> [Direct quote or file:line reference] 38> Rationale: [Why this score] 39 40**Helpfulness (X/5)**: 41 42> [Direct quote or file:line reference] 43> Rationale: [Why this score] 44 45### Verdict: [EXCELLENT | GOOD | ADEQUATE | POOR | FAILING] 46 47**Summary**: [1-2 sentence overall assessment] 48 49**Required Actions** (if verdict is ADEQUATE or worse): 50 511. [Specific improvement needed] 522. [Specific improvement needed]

Usage Examples

Evaluate a Plan Document

javascript
1// Load plan document 2Read({ file_path: '.claude/context/plans/auth-design-plan-2026-02-21.md' }); 3 4// Evaluate against 5-dimension rubric 5Skill({ skill: 'agent-evaluation' }); 6// Provide the plan content as the output to evaluate

Evaluate Agent Response Before Completion

javascript
1// Agent generates implementation summary 2// Before marking task complete, evaluate the summary quality 3Skill({ skill: 'agent-evaluation' }); 4// If composite < 3.5, request improvements before TaskUpdate(completed)

Evaluate Code Review Output

javascript
1// After code-reviewer runs, evaluate the review quality 2Skill({ skill: 'agent-evaluation' }); 3// Ensures review is grounded in actual code evidence, not assertions

Batch Evaluation (comparing two outputs)

javascript
1// Evaluate output A 2// Save verdict A 3// Evaluate output B 4// Save verdict B 5// Compare composites → choose higher scoring output

Integration with Verification-Before-Completion

The recommended quality gate pattern:

javascript
1// Step 1: Do the work 2// Step 2: Evaluate with agent-evaluation 3Skill({ skill: 'agent-evaluation' }); 4// If verdict is POOR or FAILING → rework before proceeding 5// If verdict is ADEQUATE or better → proceed to verification 6// Step 3: Final gate 7Skill({ skill: 'verification-before-completion' }); 8// Step 4: Mark complete 9TaskUpdate({ taskId: 'X', status: 'completed' });

Iron Laws

  1. NO COMPLETION CLAIM WITHOUT EVALUATION EVIDENCE — If composite score < 2.5 (POOR or FAILING), rework the output before marking any task complete.
  2. ALWAYS score all 5 dimensions — never skip dimensions to save time; each dimension catches different failure modes (accuracy ≠ completeness ≠ groundedness).
  3. ALWAYS cite specific evidence for every dimension score — "Evidence: [file:line or direct quote]" is mandatory, not optional. Assertions without grounding are invalid.
  4. ALWAYS use the weighted compositeaccuracy×0.30 + groundedness×0.25 + completeness×0.20 + coherence×0.15 + helpfulness×0.10. Never use simple average.
  5. NEVER evaluate before the work is complete — evaluating incomplete outputs produces falsely low scores and wastes context budget.

Anti-Patterns

Anti-PatternWhy It FailsCorrect Approach
Skipping dimensions to save timeEach dimension catches different failuresAlways score all 5 dimensions
No evidence citation per dimensionAssertions without grounding are invalidQuote specific text or file:line for every score
Using simple average for compositeAccuracy (30%) matters more than helpfulness (10%)Use the weighted composite formula
Only checking EXCELLENT vs FAILINGADEQUATE outputs need targeted improvements, not full reworkUse all 5 verdict tiers with appropriate action per tier
Evaluating before work is doneIncomplete outputs score falsely lowEvaluate completed outputs only
Treating evaluation as binary gateQuality is a spectrum; binary pass/fail loses nuanceUse composite score + per-dimension breakdown together

Assigned Agents

This skill is used by:

  • qa — Primary: validates test outputs and QA reports before completion
  • code-reviewer — Supporting: evaluates code review quality
  • reflection-agent — Supporting: evaluates agent responses during reflection cycles

Memory Protocol (MANDATORY)

Before starting:

bash
1cat .claude/context/memory/learnings.md

Check for:

  • Previous evaluation scores for similar outputs
  • Known quality patterns in this codebase
  • Common failure modes for this task type

After completing:

  • Evaluation pattern found -> .claude/context/memory/learnings.md
  • Quality issue identified -> .claude/context/memory/issues.md
  • Decision about rubric weights -> .claude/context/memory/decisions.md

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

相关技能

寻找 agent-evaluation 的替代方案 (Alternative) 或可搭配使用的同类 community Skill?探索以下相关开源技能。

查看全部

openclaw-release-maintainer

Logo of openclaw
openclaw

Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞

333.8k
0
AI

widget-generator

Logo of f
f

为prompts.chat的信息反馈系统生成可定制的插件小部件

149.6k
0
AI

flags

Logo of vercel
vercel

React 框架

138.4k
0
浏览器

pr-review

Logo of pytorch
pytorch

Python中具有强大GPU加速的张量和动态神经网络

98.6k
0
开发者工具