What is debug-stuck-eval?

Ideal for AI Development Agents requiring advanced troubleshooting capabilities for stuck evaluations, particularly those working with UK AISI's Inspect in the Cloud. debug-stuck-eval is a skill that enables developers to troubleshoot and debug stuck evaluations using UK AISI's Inspect in the Cloud, leveraging commands like hawk auth and hawk status.

How do I install debug-stuck-eval?

Run the command: npx killer-skills add METR/inspect-action/debug-stuck-eval. It works with Cursor, Windsurf, VS Code, Claude Code, and 15+ other IDEs.

What are the use cases for debug-stuck-eval?

Key use cases include: Debugging stuck evaluations in AI model inspections, Verifying authentication and authorization for AI services, Analyzing logs and metrics to identify error patterns in AI evaluations.

Which IDEs are compatible with debug-stuck-eval?

This skill is compatible with Cursor, Windsurf, VS Code, Claude Code, GitHub Copilot, JetBrains, Cline, Roo Code, and many more. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for debug-stuck-eval?

Requires hawk auth access-token. Dependent on UK AISI's Inspect in the Cloud infrastructure. Limited to troubleshooting evaluations, not general AI model development.

debug-stuck-eval: Troubleshooting Guide for AI Agents

Name: debug-stuck-eval
Availability: InStock
Rating: 1.8 (20 reviews)
Author: METR

Quick Checklist

Verify auth: hawk auth access-token > /dev/null || echo "Run 'hawk login' first"
Get eval-set-id from user
Check status: hawk status <eval-set-id> - JSON report with pod state, logs, metrics
View logs: hawk logs <eval-set-id> or hawk logs -f for follow mode
List samples: hawk list samples <eval-set-id> - see completion status
Look for error patterns (see below)
Test API directly if logs show retries without clear errors

Error Patterns

Log Pattern	Meaning	Resolution
`[uuid task/id/epoch model] Retrying request to /responses`	OpenAI SDK retry with sample context	Test API directly with curl to see real error
`[uuid task/id/epoch model] -> model retry N ... [ErrorType code]`	Inspect retry with error summary	Check error type; use curl for full details
`500 - Internal server error`	API issue	Download buffer, find failing request, test through middleman AND directly to provider
`400 - invalid_request_error`	Token/context limit exceeded	Check message count and model context window
`Pod UID mismatch`	Sandbox pod was killed and restarted	No fix needed—sample errored out, Inspect will retry
Empty output, `pending: true`	API returned malformed response	Restart eval (buffer resumes)
OOMKilled in pod status	Memory exhaustion	Increase pod memory limits

Key Techniques

Retry messages have sample context - All retry messages include a [sample_uuid task/sample_id/epoch model] prefix. Inspect's own retries also include a compact error summary suffix like [RateLimitError 429 rate_limit_exceeded]. The OpenAI SDK's internal retry messages still don't show the actual error — use curl for full details.
FAIL-OK patterns are fine - Alternating failures and successes mean the eval IS progressing. Only worry about consistent FAIL-FAIL-FAIL patterns.
Use S3 for buffer access - Download .buffer/ from S3 rather than accessing the runner pod directly.
Read .eval files with inspect_ai - Use from inspect_ai.log import read_eval_log instead of manually extracting zips.

Test API Directly

Middleman is the auth proxy. If middleman fails but direct provider calls work, it's a middleman issue.

bash
1TOKEN=$(hawk auth access-token)
2
3# Test through middleman
4curl --max-time 300 -X POST https://middleman.internal.metr.org/anthropic/v1/messages \
5  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
6  -d '{"model": "claude-sonnet-4-20250514", "max_tokens": 100, "messages": [{"role": "user", "content": "Say hello"}]}'
7
8# Test OpenAI-compatible
9curl --max-time 300 -X POST https://middleman.internal.metr.org/openai/v1/chat/completions \
10  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
11  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Say hello"}], "max_tokens": 100}'

Recovery

bash
1# Delete stuck eval and restart
2hawk delete <eval-set-id>
3hawk eval-set <config.yaml>

The sample buffer in S3 allows Inspect to resume from where it left off (unless you use --no-resume).

HTTP Retry Count

Task progress logs include "HTTP retries: X". High retry counts indicate API instability even while tasks complete.

Severity: Retry count × wait time = stuck duration. E.g., 45 retries × 1800s = 22+ hours stuck.

More Details

See docs/debugging-stuck-evals.md for:

Sample buffer SQL queries
Detailed API testing examples
Escalation checklist

References

Inspect AI Model Providers - Model configuration
Inspect AI Eval Logs - .eval file format

Filing Issues

Middleman: https://github.com/metr-middleman/middleman-server/issues
Hawk: Linear issue on Evals Execution team
Inspect AI: https://github.com/UKGovernmentBEIS/inspect_ai/issues

# Core Topics

↓ Quality Score

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for debug-stuck-eval MCP Server

! Prerequisites & Limits

# Tags

Quick Checklist

Error Patterns

Key Techniques

Test API Directly

Recovery

HTTP Retry Count

More Details

References

Filing Issues

Related Skills

Looking for an alternative to debug-stuck-eval or building a Categories.community AI Agent? Explore these related open-source MCP Servers.

widget-generator

chat-sdk

zustand

data-fetching

About this Skill

Features

# Core Topics

↓ Quality Score

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for debug-stuck-eval MCP Server

! Prerequisites & Limits

# Tags

Quick Checklist

Error Patterns

Key Techniques

Test API Directly

Recovery

HTTP Retry Count

More Details

References

Filing Issues

Related Skills

Looking for an alternative to debug-stuck-eval or building a Categories.community AI Agent? Explore these related open-source MCP Servers.

widget-generator

chat-sdk

zustand

data-fetching