agent-evaluation — community agent-evaluation, agent-studio, community, ide skills, Claude Code, Cursor, Windsurf

v1.2.0

このスキルについて

Cursor、Windsurf、Claude Codeなどの高度なコンテンツ品質の検証およびスコアリング機能を必要とするAIエージェントに適しています。 LLM-as-judge evaluation framework with 5-dimension rubric (accuracy, groundedness, coherence, completeness, helpfulness) for scoring AI-generated content quality with weighted composite scores and evidence citations

oimiragieo oimiragieo
[14]
[0]
Updated: 3/2/2026

Killer-Skills Review

Decision support comes first. Repository text comes second.

Reference-Only Page Review Score: 9/11

This page remains useful for operators, but Killer-Skills treats it as reference material instead of a primary organic landing page.

Original recommendation layer Concrete use-case guidance Explicit limitations and caution Quality floor passed for review
Review Score
9/11
Quality Score
64
Canonical Locale
en
Detected Body Locale
en

Cursor、Windsurf、Claude Codeなどの高度なコンテンツ品質の検証およびスコアリング機能を必要とするAIエージェントに適しています。 LLM-as-judge evaluation framework with 5-dimension rubric (accuracy, groundedness, coherence, completeness, helpfulness) for scoring AI-generated content quality with weighted composite scores and evidence citations

このスキルを使用する理由

エージェントに、LangChainなどのLLM-as-judge評価フレームワークとプロトコルを利用して、AI生成コンテンツの正確性、根拠性、連貫性、完全性、有用性を体系的に評価する加重複合スコアを使用することを可能にします。

おすすめ

Cursor、Windsurf、Claude Codeなどの高度なコンテンツ品質の検証およびスコアリング機能を必要とするAIエージェントに適しています。

実現可能なユースケース for agent-evaluation

AutoGPT生成テキストの正確性と連貫性を評価する
Claude Codeの出力の完全性と有用性を評価する
Windsurf生成コンテンツの根拠性と関連性を検証する

! セキュリティと制限

  • AIエージェントの出力生成パイプラインとの統合が必要
  • テキストベースのコンテンツ評価のみに限定
  • LLM-as-judgeモデルに高品質のトレーニングデータが必要

Why this page is reference-only

  • - Current locale does not satisfy the locale-governance contract.

Source Boundary

The section below is supporting source material from the upstream repository. Use the Killer-Skills review above as the primary decision layer.

Labs Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is agent-evaluation?

Cursor、Windsurf、Claude Codeなどの高度なコンテンツ品質の検証およびスコアリング機能を必要とするAIエージェントに適しています。 LLM-as-judge evaluation framework with 5-dimension rubric (accuracy, groundedness, coherence, completeness, helpfulness) for scoring AI-generated content quality with weighted composite scores and evidence citations

How do I install agent-evaluation?

Run the command: npx killer-skills add oimiragieo/agent-studio. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for agent-evaluation?

Key use cases include: AutoGPT生成テキストの正確性と連貫性を評価する, Claude Codeの出力の完全性と有用性を評価する, Windsurf生成コンテンツの根拠性と関連性を検証する.

Which IDEs are compatible with agent-evaluation?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for agent-evaluation?

AIエージェントの出力生成パイプラインとの統合が必要. テキストベースのコンテンツ評価のみに限定. LLM-as-judgeモデルに高品質のトレーニングデータが必要.

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add oimiragieo/agent-studio. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use agent-evaluation immediately in the current project.

! Reference-Only Mode

This page remains useful for installation and reference, but Killer-Skills no longer treats it as a primary indexable landing page. Read the review above before relying on the upstream repository instructions.

Imported Repository Instructions

The section below is supporting source material from the upstream repository. Use the Killer-Skills review above as the primary decision layer.

Supporting Evidence

agent-evaluation

Install agent-evaluation, an AI agent skill for AI agent workflows and automation. Works with Claude Code, Cursor, and Windsurf with one-command setup.

SKILL.md
Readonly
Imported Repository Instructions
The section below is supporting source material from the upstream repository. Use the Killer-Skills review above as the primary decision layer.
Supporting Evidence

Agent Evaluation

Overview

LLM-as-judge evaluation framework that scores AI-generated content on 5 dimensions using a 1-5 rubric. Agents evaluate outputs, compute a weighted composite score, and emit a structured verdict with evidence citations.

Core principle: Systematic quality verification before claiming completion. Agent-studio currently has no way to verify agent output quality — this skill fills that gap.

When to Use

Always:

  • Before marking a task complete (pair with verification-before-completion)
  • After a plan is generated (evaluate plan quality)
  • After code review outputs (evaluate review quality)
  • During reflection cycles (evaluate agent responses)
  • When comparing multiple agent outputs

Don't Use:

  • For binary pass/fail checks (use verification-before-completion instead)
  • For security audits (use security-architect skill)
  • For syntax/lint checking (use pnpm lint:fix)

The 5-Dimension Rubric

Every evaluation scores all 5 dimensions on a 1-5 scale:

DimensionWeightWhat It Measures
Accuracy30%Factual correctness; no hallucinations; claims are verifiable
Groundedness25%Claims are supported by citations, file references, or evidence from the codebase
Coherence15%Logical flow; internally consistent; no contradictions
Completeness20%All required aspects addressed; no critical gaps
Helpfulness10%Actionable; provides concrete next steps; reduces ambiguity

Scoring Scale (1-5)

ScoreMeaning
5Excellent — fully meets the dimension's criteria with no gaps
4Good — meets criteria with minor gaps
3Adequate — partially meets criteria; some gaps present
2Poor — significant gaps or errors in this dimension
1Failing — does not meet the dimension's criteria

Execution Process

Step 1: Load the Output to Evaluate

Identify what is being evaluated:

- Agent response (text)
- Plan document (file path)
- Code review output (text/file)
- Skill invocation result (text)
- Task completion claim (TaskGet metadata)

Step 2: Score Each Dimension

For each of the 5 dimensions, provide:

  1. Score (1-5): The numeric score
  2. Evidence: Direct quote or file reference from the evaluated output
  3. Rationale: Why this score was given (1-2 sentences)

Dimension 1: Accuracy

Checklist:
- [ ] Claims are factually correct (verify against codebase if possible)
- [ ] No hallucinated file paths, function names, or API calls
- [ ] Numbers and counts are accurate
- [ ] No contradictions with existing documentation

Dimension 2: Groundedness

Checklist:
- [ ] Claims cite specific files, line numbers, or task IDs
- [ ] Recommendations reference observable evidence
- [ ] No unsupported assertions ("this is probably X")
- [ ] Code examples use actual project patterns

Dimension 3: Coherence

Checklist:
- [ ] Logical flow from problem → analysis → recommendation
- [ ] No internal contradictions
- [ ] Terminology is consistent throughout
- [ ] Steps are in a rational order

Dimension 4: Completeness

Checklist:
- [ ] All required aspects of the task are addressed
- [ ] Edge cases are mentioned (if relevant)
- [ ] No critical gaps that would block action
- [ ] Follow-up steps are included

Dimension 5: Helpfulness

Checklist:
- [ ] Provides actionable next steps (not just observations)
- [ ] Concrete enough to act on without further clarification
- [ ] Reduces ambiguity rather than adding it
- [ ] Appropriate for the intended audience

Step 3: Compute Weighted Composite Score

composite = (accuracy × 0.30) + (groundedness × 0.25) + (completeness × 0.20) + (coherence × 0.15) + (helpfulness × 0.10)

Step 4: Determine Verdict

Composite ScoreVerdictAction
4.5 – 5.0EXCELLENTApprove; proceed
3.5 – 4.4GOODApprove with minor notes
2.5 – 3.4ADEQUATERequest targeted improvements
1.5 – 2.4POORReject; requires significant rework
1.0 – 1.4FAILINGReject; restart task

Step 5: Emit Structured Verdict

Output the verdict in this format:

markdown
1## Evaluation Verdict 2 3**Output Evaluated**: [Brief description of what was evaluated] 4**Evaluator**: [Agent name / task ID] 5**Date**: [ISO 8601 date] 6 7### Dimension Scores 8 9| Dimension | Score | Weight | Weighted Score | 10| ------------- | ----- | ------ | -------------- | 11| Accuracy | X/5 | 30% | X.X | 12| Groundedness | X/5 | 25% | X.X | 13| Completeness | X/5 | 20% | X.X | 14| Coherence | X/5 | 15% | X.X | 15| Helpfulness | X/5 | 10% | X.X | 16| **Composite** | | | **X.X / 5.0** | 17 18### Evidence Citations 19 20**Accuracy (X/5)**: 21 22> [Direct quote or file:line reference] 23> Rationale: [Why this score] 24 25**Groundedness (X/5)**: 26 27> [Direct quote or file:line reference] 28> Rationale: [Why this score] 29 30**Completeness (X/5)**: 31 32> [Direct quote or file:line reference] 33> Rationale: [Why this score] 34 35**Coherence (X/5)**: 36 37> [Direct quote or file:line reference] 38> Rationale: [Why this score] 39 40**Helpfulness (X/5)**: 41 42> [Direct quote or file:line reference] 43> Rationale: [Why this score] 44 45### Verdict: [EXCELLENT | GOOD | ADEQUATE | POOR | FAILING] 46 47**Summary**: [1-2 sentence overall assessment] 48 49**Required Actions** (if verdict is ADEQUATE or worse): 50 511. [Specific improvement needed] 522. [Specific improvement needed]

Usage Examples

Evaluate a Plan Document

javascript
1// Load plan document 2Read({ file_path: '.claude/context/plans/auth-design-plan-2026-02-21.md' }); 3 4// Evaluate against 5-dimension rubric 5Skill({ skill: 'agent-evaluation' }); 6// Provide the plan content as the output to evaluate

Evaluate Agent Response Before Completion

javascript
1// Agent generates implementation summary 2// Before marking task complete, evaluate the summary quality 3Skill({ skill: 'agent-evaluation' }); 4// If composite < 3.5, request improvements before TaskUpdate(completed)

Evaluate Code Review Output

javascript
1// After code-reviewer runs, evaluate the review quality 2Skill({ skill: 'agent-evaluation' }); 3// Ensures review is grounded in actual code evidence, not assertions

Batch Evaluation (comparing two outputs)

javascript
1// Evaluate output A 2// Save verdict A 3// Evaluate output B 4// Save verdict B 5// Compare composites → choose higher scoring output

Integration with Verification-Before-Completion

The recommended quality gate pattern:

javascript
1// Step 1: Do the work 2// Step 2: Evaluate with agent-evaluation 3Skill({ skill: 'agent-evaluation' }); 4// If verdict is POOR or FAILING → rework before proceeding 5// If verdict is ADEQUATE or better → proceed to verification 6// Step 3: Final gate 7Skill({ skill: 'verification-before-completion' }); 8// Step 4: Mark complete 9TaskUpdate({ taskId: 'X', status: 'completed' });

Iron Laws

  1. NO COMPLETION CLAIM WITHOUT EVALUATION EVIDENCE — If composite score < 2.5 (POOR or FAILING), rework the output before marking any task complete.
  2. ALWAYS score all 5 dimensions — never skip dimensions to save time; each dimension catches different failure modes (accuracy ≠ completeness ≠ groundedness).
  3. ALWAYS cite specific evidence for every dimension score — "Evidence: [file:line or direct quote]" is mandatory, not optional. Assertions without grounding are invalid.
  4. ALWAYS use the weighted compositeaccuracy×0.30 + groundedness×0.25 + completeness×0.20 + coherence×0.15 + helpfulness×0.10. Never use simple average.
  5. NEVER evaluate before the work is complete — evaluating incomplete outputs produces falsely low scores and wastes context budget.

Anti-Patterns

Anti-PatternWhy It FailsCorrect Approach
Skipping dimensions to save timeEach dimension catches different failuresAlways score all 5 dimensions
No evidence citation per dimensionAssertions without grounding are invalidQuote specific text or file:line for every score
Using simple average for compositeAccuracy (30%) matters more than helpfulness (10%)Use the weighted composite formula
Only checking EXCELLENT vs FAILINGADEQUATE outputs need targeted improvements, not full reworkUse all 5 verdict tiers with appropriate action per tier
Evaluating before work is doneIncomplete outputs score falsely lowEvaluate completed outputs only
Treating evaluation as binary gateQuality is a spectrum; binary pass/fail loses nuanceUse composite score + per-dimension breakdown together

Assigned Agents

This skill is used by:

  • qa — Primary: validates test outputs and QA reports before completion
  • code-reviewer — Supporting: evaluates code review quality
  • reflection-agent — Supporting: evaluates agent responses during reflection cycles

Memory Protocol (MANDATORY)

Before starting:

bash
1cat .claude/context/memory/learnings.md

Check for:

  • Previous evaluation scores for similar outputs
  • Known quality patterns in this codebase
  • Common failure modes for this task type

After completing:

  • Evaluation pattern found -> .claude/context/memory/learnings.md
  • Quality issue identified -> .claude/context/memory/issues.md
  • Decision about rubric weights -> .claude/context/memory/decisions.md

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

関連スキル

Looking for an alternative to agent-evaluation or another community skill for your workflow? Explore these related open-source skills.

すべて表示

openclaw-release-maintainer

Logo of openclaw
openclaw

Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞

333.8k
0
AI

widget-generator

Logo of f
f

カスタマイズ可能なウィジェットプラグインをprompts.chatのフィードシステム用に生成する

149.6k
0
AI

flags

Logo of vercel
vercel

React フレームワーク

138.4k
0
ブラウザ

pr-review

Logo of pytorch
pytorch

Pythonにおけるテンソルと動的ニューラルネットワーク(強力なGPUアクセラレーション)

98.6k
0
開発者