agent-evaluation — community agent-evaluation, agent-studio, community, ide skills, Claude Code, Cursor, Windsurf

v1.2.0

Об этом навыке

Идеально подходит для агентов ИИ, которым требуются продвинутые возможности проверки и оценки качества контента, такие как Cursor, Windsurf и Claude Code. LLM-as-judge evaluation framework with 5-dimension rubric (accuracy, groundedness, coherence, completeness, helpfulness) for scoring AI-generated content quality with weighted composite scores and evidence citations

oimiragieo oimiragieo
[14]
[0]
Updated: 3/2/2026

Killer-Skills Review

Decision support comes first. Repository text comes second.

Reference-Only Page Review Score: 9/11

This page remains useful for operators, but Killer-Skills treats it as reference material instead of a primary organic landing page.

Original recommendation layer Concrete use-case guidance Explicit limitations and caution Quality floor passed for review
Review Score
9/11
Quality Score
64
Canonical Locale
en
Detected Body Locale
en

Идеально подходит для агентов ИИ, которым требуются продвинутые возможности проверки и оценки качества контента, такие как Cursor, Windsurf и Claude Code. LLM-as-judge evaluation framework with 5-dimension rubric (accuracy, groundedness, coherence, completeness, helpfulness) for scoring AI-generated content quality with weighted composite scores and evidence citations

Зачем использовать этот навык

Позволяет агентам систематически оценивать контент, сгенерированный ИИ, по точности, обоснованности, связности, полноте и полезности, используя взвешенный композитный балл, опираясь на фреймворки и протоколы оценки LLM-as-judge, такие как LangChain.

Подходит лучше всего

Идеально подходит для агентов ИИ, которым требуются продвинутые возможности проверки и оценки качества контента, такие как Cursor, Windsurf и Claude Code.

Реализуемые кейсы использования for agent-evaluation

Оценка текста, сгенерированного AutoGPT, по точности и связности
Оценка выходных данных Claude Code по полноте и полезности
Проверка контента, сгенерированного Windsurf, по обоснованности и релевантности

! Безопасность и ограничения

  • Требует интеграции с конвейерами генерации выходных данных агентов ИИ
  • Ограничен оценкой текстового контента
  • Зависит от высококачественных данных для обучения моделей LLM-as-judge

Why this page is reference-only

  • - Current locale does not satisfy the locale-governance contract.

Source Boundary

The section below is supporting source material from the upstream repository. Use the Killer-Skills review above as the primary decision layer.

Labs Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is agent-evaluation?

Идеально подходит для агентов ИИ, которым требуются продвинутые возможности проверки и оценки качества контента, такие как Cursor, Windsurf и Claude Code. LLM-as-judge evaluation framework with 5-dimension rubric (accuracy, groundedness, coherence, completeness, helpfulness) for scoring AI-generated content quality with weighted composite scores and evidence citations

How do I install agent-evaluation?

Run the command: npx killer-skills add oimiragieo/agent-studio. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for agent-evaluation?

Key use cases include: Оценка текста, сгенерированного AutoGPT, по точности и связности, Оценка выходных данных Claude Code по полноте и полезности, Проверка контента, сгенерированного Windsurf, по обоснованности и релевантности.

Which IDEs are compatible with agent-evaluation?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for agent-evaluation?

Требует интеграции с конвейерами генерации выходных данных агентов ИИ. Ограничен оценкой текстового контента. Зависит от высококачественных данных для обучения моделей LLM-as-judge.

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add oimiragieo/agent-studio. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use agent-evaluation immediately in the current project.

! Reference-Only Mode

This page remains useful for installation and reference, but Killer-Skills no longer treats it as a primary indexable landing page. Read the review above before relying on the upstream repository instructions.

Imported Repository Instructions

The section below is supporting source material from the upstream repository. Use the Killer-Skills review above as the primary decision layer.

Supporting Evidence

agent-evaluation

Install agent-evaluation, an AI agent skill for AI agent workflows and automation. Works with Claude Code, Cursor, and Windsurf with one-command setup.

SKILL.md
Readonly
Imported Repository Instructions
The section below is supporting source material from the upstream repository. Use the Killer-Skills review above as the primary decision layer.
Supporting Evidence

Agent Evaluation

Overview

LLM-as-judge evaluation framework that scores AI-generated content on 5 dimensions using a 1-5 rubric. Agents evaluate outputs, compute a weighted composite score, and emit a structured verdict with evidence citations.

Core principle: Systematic quality verification before claiming completion. Agent-studio currently has no way to verify agent output quality — this skill fills that gap.

When to Use

Always:

  • Before marking a task complete (pair with verification-before-completion)
  • After a plan is generated (evaluate plan quality)
  • After code review outputs (evaluate review quality)
  • During reflection cycles (evaluate agent responses)
  • When comparing multiple agent outputs

Don't Use:

  • For binary pass/fail checks (use verification-before-completion instead)
  • For security audits (use security-architect skill)
  • For syntax/lint checking (use pnpm lint:fix)

The 5-Dimension Rubric

Every evaluation scores all 5 dimensions on a 1-5 scale:

DimensionWeightWhat It Measures
Accuracy30%Factual correctness; no hallucinations; claims are verifiable
Groundedness25%Claims are supported by citations, file references, or evidence from the codebase
Coherence15%Logical flow; internally consistent; no contradictions
Completeness20%All required aspects addressed; no critical gaps
Helpfulness10%Actionable; provides concrete next steps; reduces ambiguity

Scoring Scale (1-5)

ScoreMeaning
5Excellent — fully meets the dimension's criteria with no gaps
4Good — meets criteria with minor gaps
3Adequate — partially meets criteria; some gaps present
2Poor — significant gaps or errors in this dimension
1Failing — does not meet the dimension's criteria

Execution Process

Step 1: Load the Output to Evaluate

Identify what is being evaluated:

- Agent response (text)
- Plan document (file path)
- Code review output (text/file)
- Skill invocation result (text)
- Task completion claim (TaskGet metadata)

Step 2: Score Each Dimension

For each of the 5 dimensions, provide:

  1. Score (1-5): The numeric score
  2. Evidence: Direct quote or file reference from the evaluated output
  3. Rationale: Why this score was given (1-2 sentences)

Dimension 1: Accuracy

Checklist:
- [ ] Claims are factually correct (verify against codebase if possible)
- [ ] No hallucinated file paths, function names, or API calls
- [ ] Numbers and counts are accurate
- [ ] No contradictions with existing documentation

Dimension 2: Groundedness

Checklist:
- [ ] Claims cite specific files, line numbers, or task IDs
- [ ] Recommendations reference observable evidence
- [ ] No unsupported assertions ("this is probably X")
- [ ] Code examples use actual project patterns

Dimension 3: Coherence

Checklist:
- [ ] Logical flow from problem → analysis → recommendation
- [ ] No internal contradictions
- [ ] Terminology is consistent throughout
- [ ] Steps are in a rational order

Dimension 4: Completeness

Checklist:
- [ ] All required aspects of the task are addressed
- [ ] Edge cases are mentioned (if relevant)
- [ ] No critical gaps that would block action
- [ ] Follow-up steps are included

Dimension 5: Helpfulness

Checklist:
- [ ] Provides actionable next steps (not just observations)
- [ ] Concrete enough to act on without further clarification
- [ ] Reduces ambiguity rather than adding it
- [ ] Appropriate for the intended audience

Step 3: Compute Weighted Composite Score

composite = (accuracy × 0.30) + (groundedness × 0.25) + (completeness × 0.20) + (coherence × 0.15) + (helpfulness × 0.10)

Step 4: Determine Verdict

Composite ScoreVerdictAction
4.5 – 5.0EXCELLENTApprove; proceed
3.5 – 4.4GOODApprove with minor notes
2.5 – 3.4ADEQUATERequest targeted improvements
1.5 – 2.4POORReject; requires significant rework
1.0 – 1.4FAILINGReject; restart task

Step 5: Emit Structured Verdict

Output the verdict in this format:

markdown
1## Evaluation Verdict 2 3**Output Evaluated**: [Brief description of what was evaluated] 4**Evaluator**: [Agent name / task ID] 5**Date**: [ISO 8601 date] 6 7### Dimension Scores 8 9| Dimension | Score | Weight | Weighted Score | 10| ------------- | ----- | ------ | -------------- | 11| Accuracy | X/5 | 30% | X.X | 12| Groundedness | X/5 | 25% | X.X | 13| Completeness | X/5 | 20% | X.X | 14| Coherence | X/5 | 15% | X.X | 15| Helpfulness | X/5 | 10% | X.X | 16| **Composite** | | | **X.X / 5.0** | 17 18### Evidence Citations 19 20**Accuracy (X/5)**: 21 22> [Direct quote or file:line reference] 23> Rationale: [Why this score] 24 25**Groundedness (X/5)**: 26 27> [Direct quote or file:line reference] 28> Rationale: [Why this score] 29 30**Completeness (X/5)**: 31 32> [Direct quote or file:line reference] 33> Rationale: [Why this score] 34 35**Coherence (X/5)**: 36 37> [Direct quote or file:line reference] 38> Rationale: [Why this score] 39 40**Helpfulness (X/5)**: 41 42> [Direct quote or file:line reference] 43> Rationale: [Why this score] 44 45### Verdict: [EXCELLENT | GOOD | ADEQUATE | POOR | FAILING] 46 47**Summary**: [1-2 sentence overall assessment] 48 49**Required Actions** (if verdict is ADEQUATE or worse): 50 511. [Specific improvement needed] 522. [Specific improvement needed]

Usage Examples

Evaluate a Plan Document

javascript
1// Load plan document 2Read({ file_path: '.claude/context/plans/auth-design-plan-2026-02-21.md' }); 3 4// Evaluate against 5-dimension rubric 5Skill({ skill: 'agent-evaluation' }); 6// Provide the plan content as the output to evaluate

Evaluate Agent Response Before Completion

javascript
1// Agent generates implementation summary 2// Before marking task complete, evaluate the summary quality 3Skill({ skill: 'agent-evaluation' }); 4// If composite < 3.5, request improvements before TaskUpdate(completed)

Evaluate Code Review Output

javascript
1// After code-reviewer runs, evaluate the review quality 2Skill({ skill: 'agent-evaluation' }); 3// Ensures review is grounded in actual code evidence, not assertions

Batch Evaluation (comparing two outputs)

javascript
1// Evaluate output A 2// Save verdict A 3// Evaluate output B 4// Save verdict B 5// Compare composites → choose higher scoring output

Integration with Verification-Before-Completion

The recommended quality gate pattern:

javascript
1// Step 1: Do the work 2// Step 2: Evaluate with agent-evaluation 3Skill({ skill: 'agent-evaluation' }); 4// If verdict is POOR or FAILING → rework before proceeding 5// If verdict is ADEQUATE or better → proceed to verification 6// Step 3: Final gate 7Skill({ skill: 'verification-before-completion' }); 8// Step 4: Mark complete 9TaskUpdate({ taskId: 'X', status: 'completed' });

Iron Laws

  1. NO COMPLETION CLAIM WITHOUT EVALUATION EVIDENCE — If composite score < 2.5 (POOR or FAILING), rework the output before marking any task complete.
  2. ALWAYS score all 5 dimensions — never skip dimensions to save time; each dimension catches different failure modes (accuracy ≠ completeness ≠ groundedness).
  3. ALWAYS cite specific evidence for every dimension score — "Evidence: [file:line or direct quote]" is mandatory, not optional. Assertions without grounding are invalid.
  4. ALWAYS use the weighted compositeaccuracy×0.30 + groundedness×0.25 + completeness×0.20 + coherence×0.15 + helpfulness×0.10. Never use simple average.
  5. NEVER evaluate before the work is complete — evaluating incomplete outputs produces falsely low scores and wastes context budget.

Anti-Patterns

Anti-PatternWhy It FailsCorrect Approach
Skipping dimensions to save timeEach dimension catches different failuresAlways score all 5 dimensions
No evidence citation per dimensionAssertions without grounding are invalidQuote specific text or file:line for every score
Using simple average for compositeAccuracy (30%) matters more than helpfulness (10%)Use the weighted composite formula
Only checking EXCELLENT vs FAILINGADEQUATE outputs need targeted improvements, not full reworkUse all 5 verdict tiers with appropriate action per tier
Evaluating before work is doneIncomplete outputs score falsely lowEvaluate completed outputs only
Treating evaluation as binary gateQuality is a spectrum; binary pass/fail loses nuanceUse composite score + per-dimension breakdown together

Assigned Agents

This skill is used by:

  • qa — Primary: validates test outputs and QA reports before completion
  • code-reviewer — Supporting: evaluates code review quality
  • reflection-agent — Supporting: evaluates agent responses during reflection cycles

Memory Protocol (MANDATORY)

Before starting:

bash
1cat .claude/context/memory/learnings.md

Check for:

  • Previous evaluation scores for similar outputs
  • Known quality patterns in this codebase
  • Common failure modes for this task type

After completing:

  • Evaluation pattern found -> .claude/context/memory/learnings.md
  • Quality issue identified -> .claude/context/memory/issues.md
  • Decision about rubric weights -> .claude/context/memory/decisions.md

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

Связанные навыки

Looking for an alternative to agent-evaluation or another community skill for your workflow? Explore these related open-source skills.

Показать все

openclaw-release-maintainer

Logo of openclaw
openclaw

Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞

widget-generator

Logo of f
f

Создание настраиваемых плагинов виджетов для системы ленты новостей prompts.chat

flags

Logo of vercel
vercel

Фреймворк React

138.4k
0
Браузер

pr-review

Logo of pytorch
pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

98.6k
0
Разработчик