How do I install eval?

Run the command: npx killer-skills add Kastalien-Research/thoughtbox/eval. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for eval?

Key use cases include: 사용 사례: Applying Evaluation harness: $ARGUMENTS, 사용 사례: Applying Parse the first word of $ARGUMENTS to determine the command:, 사용 사례: Applying metrics — Show current session metrics.

Which IDEs are compatible with eval?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for eval?

제한 사항: Requires repository-specific context from the skill documentation. 제한 사항: Works best when the underlying tools and dependencies are already configured.

eval | AI Agent Skills | Killer-Skills

Name: eval
Availability: InStock
Author: Kastalien-Research

eval

Thoughtbox is an intention ledger for agents. It covers ai-agents, claude, claude-code workflows. This AI agent skill supports Claude Code, Cursor, and

SKILL.md

Readonly

Upstream Repository Material

The section below is imported from the upstream repository and should be treated as secondary evidence. Use the Killer-Skills review above as the primary layer for fit, risk, and installation decisions.

Supporting Evidence

Evaluation harness: $ARGUMENTS

Commands

Parse the first word of $ARGUMENTS to determine the command:

`metrics` — Show current session metrics

Collect and display metrics for the current session:

Count commits: git log --oneline --since="today" | wc -l
Count test results: check for recent vitest output or .eval/metrics/ entries
Token usage: check LangSmith state file if available
Pattern usage: check .dgm/fitness.json for patterns used this session
Session duration: check session start time from logs

Display as:

## Current Session Metrics

| Metric | Value | Baseline | Delta |
|--------|-------|----------|-------|
| Commits | 5 | 3.2 avg | +56% |
| Tests passing | 42/42 | 40/42 | +2 |

| Files changed | 12 | 8.5 avg | +41% |
| Patterns used | 7 | 5.3 avg | +32% |

`baseline` — Set or update baselines

Read the last N session metric snapshots from .eval/metrics/
Calculate averages for each metric
Write to .eval/baselines.json
Report what changed

`compare` — Compare sessions

Usage: compare --last N or compare --session <id>

Load metric snapshots from .eval/metrics/
Compare against baselines
Highlight regressions (metric dropped >10% below baseline)
Highlight improvements (metric improved >10% above baseline)

`report` — Generate weekly evaluation report

Load all metrics from the past 7 days
Calculate trends (improving, stable, declining)
Identify top improvements and top regressions
Generate recommendations based on trends

`capture` — Capture current session metrics

Write a metric snapshot to .eval/metrics/session-{timestamp}.json:

json
1{
2  "session_id": "<session id>",
3  "timestamp": "<ISO 8601>",
4  "branch": "<git branch>",
5  "metrics": {
6    "commits": 0,
7    "tests_total": 0,
8    "tests_passing": 0,
9    "files_changed": 0,
10    "patterns_referenced": 0,
11    "assumptions_verified": 0,
12    "escalations": 0,
13    "spiral_detections": 0
14  },
15  "qualitative": {
16    "session_focus": "<what the session was about>",
17    "memory_usefulness": 0,
18    "knowledge_gaps_found": []
19  }
20}

Notes

If .eval/baselines.json doesn't exist, skip baseline comparisons and suggest running baseline
Metric collection should be best-effort — missing data is noted, not an error
Regressions trigger a structured escalation suggestion (not automatic action)

eval — for Claude Code thoughtbox, community, for Claude Code, ide skills, ai-agents, claude-code, claudecode, model-context-protocol, observability, reasoning

# Core Topics

Killer-Skills Review

이 스킬을 사용하는 이유

최적의 용도

↓ 실행 가능한 사용 사례 for eval

! 보안 및 제한 사항

Why this page is reference-only

Source Boundary

Decide The Next Action Before You Keep Reading Repository Material

Start With Installation And Validation

Cross-Check Against Trusted Picks

Move To Workflow Collections For Team Rollout

Browser Sandbox Environment

⚡️ Ready to unleash?

FAQ & Installation Steps

? Frequently Asked Questions

What is eval?

How do I install eval?

What are the use cases for eval?

Which IDEs are compatible with eval?

Are there any limitations for eval?

↓ How To Install

! Reference-Only Mode

Upstream Repository Material

eval

Commands

`metrics` — Show current session metrics

`baseline` — Set or update baselines

`compare` — Compare sessions

`report` — Generate weekly evaluation report

`capture` — Capture current session metrics

Notes

관련 스킬

Looking for an alternative to eval or another community skill for your workflow? Explore these related open-source skills.

openclaw-release-maintainer

widget-generator

flags

pr-review

eval — for Claude Code thoughtbox, community, for Claude Code, ide skills, ai-agents, claude-code, claudecode, model-context-protocol, observability, reasoning

이 스킬 정보

기능

# Core Topics

Killer-Skills Review

이 스킬을 사용하는 이유

최적의 용도

↓ 실행 가능한 사용 사례 for eval

! 보안 및 제한 사항

Why this page is reference-only

Source Boundary

Decide The Next Action Before You Keep Reading Repository Material

Start With Installation And Validation

Cross-Check Against Trusted Picks

Move To Workflow Collections For Team Rollout

Browser Sandbox Environment

⚡️ Ready to unleash?

FAQ & Installation Steps

? Frequently Asked Questions

What is eval?

How do I install eval?

What are the use cases for eval?

Which IDEs are compatible with eval?

Are there any limitations for eval?

↓ How To Install

! Reference-Only Mode

Upstream Repository Material

eval

Commands

metrics — Show current session metrics

baseline — Set or update baselines

compare — Compare sessions

report — Generate weekly evaluation report

capture — Capture current session metrics

Notes

관련 스킬

Looking for an alternative to eval or another community skill for your workflow? Explore these related open-source skills.

openclaw-release-maintainer

widget-generator

flags

pr-review

`metrics` — Show current session metrics

`baseline` — Set or update baselines

`compare` — Compare sessions

`report` — Generate weekly evaluation report

`capture` — Capture current session metrics