summarize-experiment — community summarize-experiment, cruijff_kit, community, ide skills

v1.0.0

About this Skill

Perfect for Research Agents needing automated experiment result analysis and summary generation. Tools for conducting social research with LLMs

niznik-dev niznik-dev
[11]
[0]
Updated: 3/11/2026

Killer-Skills Review

Decision support comes first. Repository text comes second.

Reviewed Landing Page Review Score: 9/11

Killer-Skills keeps this page indexable because it adds recommendation, limitations, and review signals beyond the upstream repository text.

Original recommendation layer Concrete use-case guidance Explicit limitations and caution Quality floor passed for review Locale and body language aligned
Review Score
9/11
Quality Score
55
Canonical Locale
en
Detected Body Locale
en

Perfect for Research Agents needing automated experiment result analysis and summary generation. Tools for conducting social research with LLMs

Core Value

Empowers agents to parse experiment results from YAML files, extract training loss from SLURM stdout, and evaluation accuracy from inspect-ai .eval files, generating a comprehensive summary.md file using Python and integrating with LLMs for social research.

Ideal Agent Persona

Perfect for Research Agents needing automated experiment result analysis and summary generation.

Capabilities Granted for summarize-experiment

Automating experiment result summarization for researchers
Generating summary.md files for experiment directories
Extracting key metrics such as training loss and evaluation accuracy from experiment outputs

! Prerequisites & Limits

  • Requires experiment_summary.yaml file to exist
  • Needs Conda environment activated with inspect-ai installed
  • Limited to experiments with completed runs and available SLURM outputs

Source Boundary

The section below is imported from the upstream repository and should be treated as secondary evidence. Use the Killer-Skills review above as the primary layer for fit, risk, and installation decisions.

After The Review

Decide The Next Action Before You Keep Reading Repository Material

Killer-Skills should not stop at opening repository instructions. It should help you decide whether to install this skill, when to cross-check against trusted collections, and when to move into workflow rollout.

Labs Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is summarize-experiment?

Perfect for Research Agents needing automated experiment result analysis and summary generation. Tools for conducting social research with LLMs

How do I install summarize-experiment?

Run the command: npx killer-skills add niznik-dev/cruijff_kit/summarize-experiment. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for summarize-experiment?

Key use cases include: Automating experiment result summarization for researchers, Generating summary.md files for experiment directories, Extracting key metrics such as training loss and evaluation accuracy from experiment outputs.

Which IDEs are compatible with summarize-experiment?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for summarize-experiment?

Requires experiment_summary.yaml file to exist. Needs Conda environment activated with inspect-ai installed. Limited to experiments with completed runs and available SLURM outputs.

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add niznik-dev/cruijff_kit/summarize-experiment. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use summarize-experiment immediately in the current project.

Upstream Repository Material

The section below is imported from the upstream repository and should be treated as secondary evidence. Use the Killer-Skills review above as the primary layer for fit, risk, and installation decisions.

Upstream Source

summarize-experiment

Install summarize-experiment, an AI agent skill for AI agent workflows and automation. Review the use cases, limitations, and setup path before rollout.

SKILL.md
Readonly
Upstream Repository Material
The section below is imported from the upstream repository and should be treated as secondary evidence. Use the Killer-Skills review above as the primary layer for fit, risk, and installation decisions.
Supporting Evidence

Summarize Experiment

Generate a summary.md file capturing key metrics from a completed experiment. Think R's summary() for experiment results.

Your Task

Create a lightweight summary of experiment results:

  1. Parse run status from experiment_summary.yaml
  2. Extract final training loss from SLURM stdout
  3. Extract accuracy from inspect-ai .eval files
  4. Generate summary.md in experiment directory
  5. Log the process in logs/summarize-experiment.log

Prerequisites

  • experiment_summary.yaml exists
  • At least some runs have completed (partial results acceptable)
  • run-experiment has been executed (or manual SLURM jobs run)
  • Conda environment activated - The parse_eval_log.py script requires inspect-ai. Activate the conda environment from claude.local.md before running extraction commands.

Workflow

1. Locate Experiment

Find the experiment directory:

  • If in an experiment directory (contains experiment_summary.yaml): use current directory
  • Otherwise: ask user for path

2. Parse Run Status

Read experiment_summary.yaml to identify runs:

From runs: section:

  • name: Run identifier
  • type: "fine-tuned" or "control"
  • model: Model name
  • parameters: Dict of hyperparameters (empty for control runs)

From evaluation.matrix: section:

  • run: Run name
  • tasks: List of evaluation task names
  • epochs: List of epochs to evaluate (null for control runs)

Determine status by checking filesystem:

  • Fine-tuning: Check for {output_base}/ck-out-{run_name}/ and SLURM outputs
  • Evaluation: Check for {run_dir}/eval/logs/*.eval files

3. Extract Training Loss

For each COMPLETED fine-tuning run:

  1. Find SLURM stdout in the output directory:
    • Parse experiment_summary.yaml "Output" section for output_dir_base
    • Look in: {output_dir_base}/ck-out-{run_name}/slurm-*.out
    • If multiple files, use most recent by modification time
  2. Extract final loss using cruijff_kit.tools.torchtune.extract_loss:
    python
    1from cruijff_kit.tools.torchtune.extract_loss import final_loss 2result = final_loss(slurm_text) # returns (epoch, step, loss) or None
    • The canonical regex and helpers live in tools/torchtune/extract_loss.py
    • final_loss() returns the last match (epoch, step, loss) or None
    • extract_losses() returns all matches as a list
    • The step number from the last match is the total training steps
  3. Record: run_name, final_loss, total_steps, epoch, step

Note: Training SLURM outputs are in the output directory, NOT the run directory.

If SLURM stdout missing:

  • Log warning
  • Record "N/A" for loss
  • Continue with other runs

4. Extract Evaluation Accuracy

For each COMPLETED evaluation:

  1. Find .eval files: {run_dir}/eval/logs/*.eval
  2. For each .eval file, run:
    bash
    1python tools/inspect/parse_eval_log.py {path}
  3. Parse JSON output for accuracy
  4. Map to epoch using SLURM job names (see below)
  5. For binary tasks, also run summary_binary.py to get balanced accuracy and F1
  6. Record: run_name, task, epoch, accuracy, balanced_accuracy, f1, samples

Script output format:

json
1{ 2 "status": "success", 3 "task": "capitalization", 4 "accuracy": 0.85, 5 "samples": 100, 6 "scorer": "exact_match", 7 "model": "..." 8}

Mapping Epochs via SLURM Job Names

The .eval files don't currently store epoch information directly. To reliably map each evaluation to its epoch:

  1. Find SLURM output files in the eval directory: {run_dir}/eval/slurm-*.out
  2. Extract job IDs from filenames (e.g., slurm-2773062.out → job ID 2773062)
  3. Query job names via sacct:
    bash
    1sacct -j {job_ids} --format=JobID,JobName%50
  4. Parse epoch from job name - scaffold-inspect names jobs like eval-{task}-{run}-ep{N}:
    • eval-general_eval-lowlr-ep0 → epoch 0
    • eval-general_eval-lowlr-ep9 → epoch 9
  5. Extract accuracy from SLURM output:
    bash
    1grep -oP 'match/accuracy: \K[0-9.]+' slurm-{jobid}.out

Example workflow:

bash
1# Get job names for all eval jobs 2sacct -j 2773062,2773063,2773065 --format=JobID,JobName%50 3 4# Output shows epoch in job name: 5# 2773062 eval-general_eval-lowlr-ep0 6# 2773063 eval-general_eval-lowlr-ep1 7# 2773065 eval-general_eval-lowlr-ep2

This approach is reliable because:

  • Job names are set by scaffold-inspect and include epoch info
  • Works regardless of submission order or timing
  • Survives job failures and resubmissions

If extraction fails:

  • Script returns {"status": "error", "message": "..."}
  • Log the error
  • Record "ERROR" for accuracy
  • Continue with other evaluations

Computing Balanced Accuracy and F1 (Binary Classification)

For binary classification tasks (0/1 targets), use summary_binary.py to compute additional metrics:

bash
1python tools/inspect/summary_binary.py {path_to_eval_file} --json

JSON output format:

json
1{ 2 "status": "success", 3 "path": "/path/to/file.eval", 4 "samples": 100, 5 "accuracy": 0.85, 6 "balanced_accuracy": 0.83, 7 "f1": 0.82, 8 "precision_1": 0.80, 9 "recall_1": 0.84, 10 "recall_0": 0.82, 11 "confusion_matrix": {"tp": 42, "tn": 43, "fp": 7, "fn": 8, "other": 0} 12}

Why these metrics matter for imbalanced data:

  • Balanced Accuracy = (Recall_0 + Recall_1) / 2 — not inflated by majority class
  • F1 Score = harmonic mean of precision and recall — penalizes class imbalance

Note: For non-binary tasks, only accuracy is reported (Bal. Acc and F1 shown as "-").

5. Generate summary.md

Create {experiment_dir}/summary.md with the following structure:

markdown
1# Experiment Summary 2 3**Experiment:** `{experiment_name}` | **Generated:** {timestamp} | **Status:** {X}/{Y} complete 4 5## Run Status 6 7| Run | Type | Fine-tuning | Evaluation | 8|-----|------|-------------|------------| 9| rank4_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED | 10| rank8_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED | 11| base_model | Control | N/A | COMPLETED | 12 13## Training Results 14 15| Run | Final Loss | Total Steps | Epochs | Duration | 16|-----|------------|-------------|--------|----------| 17| rank4_lr1e-5 | 0.234 | 250 | 2 | 8m 15s | 18| rank8_lr1e-5 | 0.198 | 250 | 2 | 9m 02s | 19 20**Notes:** 21- Base model runs have no training loss (control) 22- Duration from SLURM elapsed time (if available) 23 24## Evaluation Results 25 26| Run | Task | Epoch | Accuracy | Bal. Acc | F1 | Samples | 27|-----|------|-------|----------|----------|------|---------| 28| rank4_lr1e-5 | capitalization | 0 | 0.85 | 0.83 | 0.82 | 100 | 29| rank4_lr1e-5 | capitalization | 1 | 0.88 | 0.86 | 0.85 | 100 | 30| rank8_lr1e-5 | capitalization | 0 | 0.82 | 0.80 | 0.78 | 100 | 31| rank8_lr1e-5 | capitalization | 1 | 0.91 | 0.89 | 0.88 | 100 | 32| base_model | capitalization | - | 0.45 | 0.50 | 0.31 | 100 | 33 34**Best performing:** rank8_lr1e-5 (epoch 1) with 89% balanced accuracy 35 36## Incomplete Runs 37 38| Run | Stage | Status | Notes | 39|-----|-------|--------|-------| 40| rank16_lr1e-5 | Fine-tuning | FAILED | Check slurm-12345.out | 41 42## Next Steps 43 441. View detailed evaluation results: `inspect view --port=$(get_free_port)` 452. Export raw data: `inspect log export {run_dir}/eval/logs/*.eval --format csv` 463. Full analysis: `analyze-experiment` (when available) 47 48--- 49*Generated by summarize-experiment skill*

6. Create Log

Document the process in {experiment_dir}/logs/summarize-experiment.log.

See logging.md for action types and format.

Error Handling

If SLURM stdout missing

  • Log warning with action type EXTRACT_LOSS
  • Record "N/A" for loss in summary
  • Continue with other runs

If .eval file cannot be parsed

  • Log error with file path
  • Record "ERROR" for accuracy in summary
  • Continue with other evaluations

If all runs failed

  • Generate summary noting all failures
  • Include failure states in "Incomplete Runs" section
  • Suggest troubleshooting steps

If partial results

  • Generate summary with available data
  • Clearly indicate which runs are missing in "Incomplete Runs" section
  • Still identify best performing run from available data

Idempotency

Running summarize-experiment multiple times overwrites summary.md. This is intentional:

  • Allows re-running after fixing failed runs
  • Summary always reflects current state

Output Files

{experiment_dir}/
├── summary.md                    # Human-readable summary (new)
└── logs/
    └── summarize-experiment.log  # Process log (new)

Relationship to Other Skills

  • After: run-experiment (or manual execution)
  • Before: analyze-experiment (when available)
  • Optional hook: run-experiment can invoke this at completion

Future Compatibility

When analyze-experiment is built, summarize-experiment can either:

  • Remain as a quick summary option (text only, no plots)
  • Be deprecated in favor of richer output
  • Become a first stage that analyze-experiment builds upon

Related Skills

Looking for an alternative to summarize-experiment or another community skill for your workflow? Explore these related open-source skills.

View All

openclaw-release-maintainer

Logo of openclaw
openclaw

Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞

333.8k
0
AI

widget-generator

Logo of f
f

Generate customizable widget plugins for the prompts.chat feed system

149.6k
0
AI

flags

Logo of vercel
vercel

The React Framework

138.4k
0
Browser

pr-review

Logo of pytorch
pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

98.6k
0
Developer