What is analyze-experiment?

Perfect for Data Analysis Agents needing advanced experiment visualization capabilities with inspect-viz and YAML parsing. Generate visualizations from completed experiment evaluations using inspect-viz. Use after run-experiment to create interactive HTML plots from inspect-ai evaluation logs.

How do I install analyze-experiment?

Run the command: npx killer-skills add niznik-dev/cruijff_kit/analyze-experiment. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for analyze-experiment?

Key use cases include: Visualizing experiment results for researchers and developers, Generating interactive HTML plots for data analysis and presentation, Inferring appropriate visualization types based on experimental variables for automated reporting.

Which IDEs are compatible with analyze-experiment?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for analyze-experiment?

Requires inspect-viz library. Needs access to experiment_summary.yaml and evaluation logs. Limited to generating HTML and PNG plots.

Analyze Experiment

Name: analyze-experiment
Availability: InStock
Author: niznik-dev

You help users visualize and analyze results from completed experiments by generating interactive HTML and PNG plots using inspect-viz pre-built views.

Your Task

Generate visualizations from evaluation results:

Read experiment_summary.yaml to understand experimental design
Load evaluation logs from run directories
Infer appropriate visualization types based on experimental variables
Generate interactive HTML plots using inspect-viz
Log the process in logs/analyze-experiment.log

Prerequisites

experiment_summary.yaml exists (from design-experiment)
Evaluations complete (from run-experiment)
inspect-viz installed (pip install inspect-viz)
Conda environment activated (from claude.local.md)
Optional: playwright for PNG export (pip install playwright && playwright install chromium)

Workflow

1. Locate Experiment → `parsing.md`

Find experiment directory and parse experiment_summary.yaml for:

Experimental variables (model, task, factors)
Run structure and naming conventions
Evaluation matrix

See parsing.md for:

How to find the experiment directory
YAML parsing logic
Extracting variable types for inference

2. Load Evaluation Logs → `data_loading.md`

Use helper functions from tools/inspect/viz_helpers.py:

deduplicate_eval_files() - Remove duplicate evals (keeps most recent per model+epoch)
evals_df_prep() - Prepare eval-level dataframes
parse_eval_metadata() - Extract epoch/finetuned/source_model from JSON metadata
detect_metrics() - Dynamically detect available score columns

See data_loading.md for:

Automatic deduplication of re-run evaluations
Generic metadata extraction using vis_label and JSON metadata
How to construct subdirs list from experiment_summary.yaml
Preparing data for each view type

3. Select Visualizations

Map experimental variables to appropriate pre-built views:

Variable Type	View
Multiple models	`scores_by_model`
Binary factor	`scores_by_factor`
Multiple tasks/conditions	`scores_by_task`
Model × task matrix	`scores_heatmap`
Multiple metrics	`scores_radar_by_task`

4. Generate Plots → `generation.md`

Create visualizations using inspect-viz:

Wrap dataframes with Data.from_dataframe()
Call appropriate pre-built view functions
Save HTML with write_html()
Auto-export PNG with write_png() if playwright is available

See generation.md for:

Dynamic metric detection
View function signatures and parameters
PNG export with playwright auto-detection
Output file naming conventions
Creating multiple plots per experiment

5. Generate Analysis & Interpretation → `generation.md`

This step is required by default. Write a future_directions string that interprets the results and suggests next steps. This section should:

Interpret the key findings (what do the metrics mean for the research question?)
Note any surprises or anomalies in the results
Suggest concrete next experiments or analyses

Only omit this section if the user explicitly asks to skip it.

6. Generate Report → `generation.md`

Create markdown report with metrics and comparisons:

Extract metrics from evaluation dataframe
Compute Wilson score confidence intervals
Generate narrative summary and comparison tables
Include future_directions from step 5
Write report to analysis/report.md

Uses tools/inspect/report_generator.py:

python
1from tools.inspect.report_generator import generate_report
2
3report = generate_report(
4    df=logs_df,
5    experiment_name=experiment_name,
6    output_path=Path(experiment_dir) / "analysis" / "report.md",
7    config=experiment_config,  # Optional, for research question metadata
8    future_directions=future_directions,  # From step 5
9)

6b. Compute Utilization Report → `generation.md`

If run logs exist (logs/run-torchtune.log and/or logs/run-inspect.log), generate a compute utilization section:

Extract job IDs from logs (regex: Result: Job ID (\d+))
Run seff for each job and parse with tools/slurm/compute_metrics.py
Read gpu_metrics.csv files and summarize with summarize_gpu_metrics()
Generate compute table with format_compute_table()
Save raw metrics to analysis/compute_metrics.json
Pass formatted table as compute_section to generate_report()

Skip silently if no run logs exist (e.g., analyzing results without having run the experiment locally).

7. Logging → `logging.md`

Document process in {experiment_dir}/logs/analyze-experiment.log

See logging.md for:

JSONL format specification
Action types (LOCATE, PARSE, LOAD, INFER, GENERATE)
Example log entries

Supported Pre-built Views

View	Use Case	Required Data
`scores_by_task`	Multiple tasks/conditions	task_name column
`scores_heatmap`	Model × task matrix	model + task columns
`scores_radar_by_task`	Multiple metrics	Multiple score columns
`scores_by_factor`	Binary factors	Boolean factor column
`scores_by_model`	Cross-model comparison (single task only)	model column, single task

Note: scores_by_model requires a single-task experiment. For multi-task experiments, use scores_by_task, scores_heatmap, or scores_radar_by_task instead.

User Questions

Existing Analysis Outputs

If analysis/ directory exists with files, ask the user:

Found existing analysis outputs in analysis/. What would you like to do?

1. Keep existing files, add new outputs (Recommended)
2. Clean analysis directory first

If user chooses option 2, delete contents of analysis/ before generating new outputs.

Visualization Selection

When vis_label creates multiple task variants (conditions), ask the user which visualization to generate:

Found {N} conditions via vis_label: {list}

Which visualization would you like?

1. scores_by_task - Compare conditions side-by-side (Recommended)
2. scores_heatmap - Model × condition matrix
3. Both

Tracking Generated Files

Important: Track which PNG files you generate during this run. Only pass those to generate_report() so the report embeds only the visualizations created in this session, not stale outputs from previous runs.

python
1generated_pngs = []
2
3# After each successful PNG export
4generated_pngs.append(png_path)
5
6# Pass to report generator
7generate_report(..., generated_pngs=generated_pngs)

Smart defaults for everything else:

Deduplication: Automatic (keep most recent per model+epoch)
Epochs: Use latest only (unless user explicitly asks for all)
Metrics: Plot all available (dynamically detected)
PNG: Auto-detect playwright; generate if available, skip with warning if not

Output Structure

After running, the experiment directory will contain:

{experiment_dir}/
├── analysis/
│   ├── report.md               # Markdown report with metrics
│   ├── scores_by_task.html
│   ├── scores_by_task.png      (if playwright available)
│   ├── scores_heatmap.html
│   ├── scores_heatmap.png      (if playwright available)
│   ├── roc_curves.png          (if risk_scorer used)
│   ├── calibration_curves.png  (if risk_scorer used)
│   ├── prediction_histogram.png (if risk_scorer used)
│   └── ...
├── logs/
│   └── analyze-experiment.log
└── experiment_summary.yaml

Error Handling

If experiment_summary.yaml not found:

Report error to user
Suggest running design-experiment skill first
Do not proceed

If no evaluation logs found:

Report error to user
Suggest running run-experiment skill first
Do not proceed

If visualization generation fails:

Log error details in analyze-experiment.log
Continue with remaining visualizations
Report which visualizations failed in summary

If inference cannot determine view type:

Log warning
Ask user which view to generate
Or skip and note in summary

Validation Before Completion

Before reporting success, verify:

✓ experiment_summary.yaml was found and parsed
✓ Evaluation logs were loaded successfully
✓ At least one visualization was generated
✓ HTML files exist in analysis/ directory
✓ report.md was generated in analysis/ directory
✓ Log file created (logs/analyze-experiment.log)

Output Summary

After completing analysis, provide a summary:

markdown
1## Analyze Experiment Complete
2
3Experiment: `{experiment_dir}`
4
5### Report Generated
6
7✓ Markdown report: `analysis/report.md`
8  - Executive summary with best performer
9  - Model comparison table with 95% CIs
10  - Calibration metrics (if available)
11
12### Visualizations Generated
13
14✓ {N} visualizations created in `analysis/`
15
16**Created:**
17- scores_by_task.html - Task comparison (match accuracy)
18- scores_heatmap.html - Model × task matrix
19- scores_by_model.html - Model comparison
20
21### Viewing Results
22
23Open HTML files in a browser:
24```bash
25open {experiment_dir}/analysis/scores_by_task.html

Or start a local server:

bash
1cd {experiment_dir}/analysis && python -m http.server 8080

Log

Details recorded in logs/analyze-experiment.log

Report Path

{experiment_dir}/analysis/report.md


**IMPORTANT:** Always end your output summary with the **full absolute path** to `report.md` on its own line, so the user can command-click it in their terminal/IDE. Example:

Full report: /scratch/gpfs/user/ck-experiments/my_experiment/analysis/report.md


## Relationship to Other Skills

- **After:** run-experiment, summarize-experiment
- **Reads:** experiment_summary.yaml, .eval files
- **Creates:** HTML visualizations, analysis log

**Workflow position:**

design-experiment → scaffold-experiment → run-experiment → analyze-experiment


## Important Notes

- **Metadata-driven** - uses `vis_label` metadata for task names, JSON metadata for epoch/finetuned/source_model
- **Automatic deduplication** - when multiple evals exist for the same model+epoch, keeps only the most recent (logs skipped files)
- **Dynamic metric detection** - plots all available score columns (no hardcoding)
- **Must run after evaluations complete** - requires .eval files to exist
- Visualizations are independent - one failing doesn't stop others
- Output directory (`analysis/`) is created if it doesn't exist
- HTML files are standalone (no external dependencies to view)
- PNG files require playwright (skipped with warning if not installed)
- Uses helper functions from `tools/inspect/viz_helpers.py`
- Reference examples in `viz_examples/scripts/inspect_viz_examples.ipynb`

## Module Organization

| Module | Purpose |
|--------|---------|
| parsing.md | Experiment location and YAML parsing |
| data_loading.md | Helper functions for loading logs |
| generation.md | Plot creation workflow |
| logging.md | JSONL audit trail specification |

## Implementation Reference

Working examples exist in `viz_examples/scripts/inspect_viz_examples.ipynb` demonstrating all supported pre-built views.

Helper functions are implemented in `tools/inspect/viz_helpers.py`.

analyze-experiment — community analyze-experiment, cruijff_kit, community, ide skills

About this Skill

Killer-Skills Review

Core Value

Ideal Agent Persona

↓ Capabilities Granted for analyze-experiment

! Prerequisites & Limits

Source Boundary

Decide The Next Action Before You Keep Reading Repository Material

Start With Installation And Validation

Cross-Check Against Trusted Picks

Move To Workflow Collections For Team Rollout

Browser Sandbox Environment

⚡️ Ready to unleash?

FAQ & Installation Steps

? Frequently Asked Questions

What is analyze-experiment?

How do I install analyze-experiment?

What are the use cases for analyze-experiment?

Which IDEs are compatible with analyze-experiment?

Are there any limitations for analyze-experiment?

↓ How To Install

Upstream Repository Material

analyze-experiment

Analyze Experiment

Your Task

Prerequisites

Workflow

1. Locate Experiment → parsing.md

2. Load Evaluation Logs → data_loading.md

3. Select Visualizations

4. Generate Plots → generation.md

5. Generate Analysis & Interpretation → generation.md

6. Generate Report → generation.md

6b. Compute Utilization Report → generation.md

7. Logging → logging.md