eval — for Claude Code thoughtbox, community, for Claude Code, ide skills, evaluation harness, AI decision metrics, Thoughtbox intention ledger, sequential thinking, observability, reasoning, Claude Code

v1.0.0

Об этом навыке

The eval skill provides a comprehensive evaluation harness for AI agents, enabling developers to track and analyze AI decision-making metrics. By leveraging this skill, developers can optimize their AI coding workflows and improve overall efficiency.

Возможности

Parse arguments using $ARGUMENTS
Show current session metrics using `metrics` command
Set or update baselines using `baseline` command
Compare sessions using `compare` command
Generate weekly evaluation reports using `report` command
Capture current session metrics using `capture` command

# Core Topics

Kastalien-Research Kastalien-Research
[52]
[12]
Updated: 3/29/2026

Killer-Skills Review

Decision support comes first. Repository text comes second.

Reference-Only Page Review Score: 2/11

This page remains useful for operators, but Killer-Skills treats it as reference material instead of a primary organic landing page.

Review Score
2/11
Quality Score
47
Canonical Locale
en
Detected Body Locale
en

The eval skill provides a comprehensive evaluation harness for AI agents, enabling developers to track and analyze AI decision-making metrics. By leveraging this skill, developers can optimize their AI coding workflows and improve overall efficiency.

Зачем использовать этот навык

The eval skill provides a comprehensive evaluation harness for AI agents, enabling developers to track and analyze AI decision-making metrics. By leveraging this skill, developers can optimize their AI coding workflows and improve overall efficiency.

Подходит лучше всего

Suitable for operator workflows that need explicit guardrails before installation and execution.

Реализуемые кейсы использования for eval

! Безопасность и ограничения

Why this page is reference-only

  • - Current locale does not satisfy the locale-governance contract.
  • - The page lacks a strong recommendation layer.
  • - The page lacks concrete use-case guidance.
  • - The page lacks explicit limitations or caution signals.
  • - The underlying skill quality score is below the review floor.

Source Boundary

The section below is supporting source material from the upstream repository. Use the Killer-Skills review above as the primary decision layer.

Labs Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is eval?

The eval skill provides a comprehensive evaluation harness for AI agents, enabling developers to track and analyze AI decision-making metrics. By leveraging this skill, developers can optimize their AI coding workflows and improve overall efficiency.

How do I install eval?

Run the command: npx killer-skills add Kastalien-Research/thoughtbox. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

Which IDEs are compatible with eval?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add Kastalien-Research/thoughtbox. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use eval immediately in the current project.

! Reference-Only Mode

This page remains useful for installation and reference, but Killer-Skills no longer treats it as a primary indexable landing page. Read the review above before relying on the upstream repository instructions.

Imported Repository Instructions

The section below is supporting source material from the upstream repository. Use the Killer-Skills review above as the primary decision layer.

Supporting Evidence

eval

Unlock AI decision-making insights with eval, an AI agent skill for Claude Code. Evaluate metrics, track progress, and optimize AI coding workflows....

SKILL.md
Readonly
Imported Repository Instructions
The section below is supporting source material from the upstream repository. Use the Killer-Skills review above as the primary decision layer.
Supporting Evidence

Evaluation harness: $ARGUMENTS

Commands

Parse the first word of $ARGUMENTS to determine the command:

metrics — Show current session metrics

Collect and display metrics for the current session:

  1. Count commits: git log --oneline --since="today" | wc -l
  2. Count test results: check for recent vitest output or .eval/metrics/ entries
  3. Count beads changes: bd list --status=closed recently
  4. Token usage: check LangSmith state file if available
  5. Pattern usage: check .dgm/fitness.json for patterns used this session
  6. Session duration: check session start time from logs

Display as:

## Current Session Metrics

| Metric | Value | Baseline | Delta |
|--------|-------|----------|-------|
| Commits | 5 | 3.2 avg | +56% |
| Tests passing | 42/42 | 40/42 | +2 |
| Beads closed | 3 | 2.1 avg | +43% |
| Files changed | 12 | 8.5 avg | +41% |
| Patterns used | 7 | 5.3 avg | +32% |

baseline — Set or update baselines

  1. Read the last N session metric snapshots from .eval/metrics/
  2. Calculate averages for each metric
  3. Write to .eval/baselines.json
  4. Report what changed

compare — Compare sessions

Usage: compare --last N or compare --session <id>

  1. Load metric snapshots from .eval/metrics/
  2. Compare against baselines
  3. Highlight regressions (metric dropped >10% below baseline)
  4. Highlight improvements (metric improved >10% above baseline)

report — Generate weekly evaluation report

  1. Load all metrics from the past 7 days
  2. Calculate trends (improving, stable, declining)
  3. Identify top improvements and top regressions
  4. Generate recommendations based on trends

capture — Capture current session metrics

Write a metric snapshot to .eval/metrics/session-{timestamp}.json:

json
1{ 2 "session_id": "<session id>", 3 "timestamp": "<ISO 8601>", 4 "branch": "<git branch>", 5 "metrics": { 6 "commits": 0, 7 "tests_total": 0, 8 "tests_passing": 0, 9 "beads_closed": 0, 10 "beads_created": 0, 11 "files_changed": 0, 12 "patterns_referenced": 0, 13 "assumptions_verified": 0, 14 "escalations": 0, 15 "spiral_detections": 0 16 }, 17 "qualitative": { 18 "session_focus": "<what the session was about>", 19 "memory_usefulness": 0, 20 "knowledge_gaps_found": [] 21 } 22}

Notes

  • If .eval/baselines.json doesn't exist, skip baseline comparisons and suggest running baseline
  • Metric collection should be best-effort — missing data is noted, not an error
  • Regressions trigger a structured escalation suggestion (not automatic action)

Связанные навыки

Looking for an alternative to eval or another community skill for your workflow? Explore these related open-source skills.

Показать все

openclaw-release-maintainer

Logo of openclaw
openclaw

Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞

widget-generator

Logo of f
f

Создание настраиваемых плагинов виджетов для системы ленты новостей prompts.chat

flags

Logo of vercel
vercel

Фреймворк React

138.4k
0
Браузер

pr-review

Logo of pytorch
pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

98.6k
0
Разработчик