KS
Killer-Skills

debug-stuck-eval — how to use debug-stuck-eval how to use debug-stuck-eval, debug-stuck-eval setup guide, debug-stuck-eval alternative, debug-stuck-eval vs hawk, what is debug-stuck-eval, debug-stuck-eval install, debug-stuck-eval troubleshooting, UK AISI Inspect in the Cloud, AI evaluation debugging, hawk auth and status commands

v1.0.0
GitHub

About this Skill

Ideal for AI Development Agents requiring advanced troubleshooting capabilities for stuck evaluations, particularly those working with UK AISI's Inspect in the Cloud. debug-stuck-eval is a skill that enables developers to troubleshoot and debug stuck evaluations using UK AISI's Inspect in the Cloud, leveraging commands like hawk auth and hawk status.

Features

Verifies authentication using `hawk auth access-token` command
Retrieves evaluation set ID from users and checks status using `hawk status`
Provides JSON reports with pod state, logs, and metrics
Supports log viewing and follow mode using `hawk logs` and `hawk logs -f`
Lists samples with completion status using `hawk list samples`
Enables direct API testing for troubleshooting retries and errors

# Core Topics

METR METR
[20]
[9]
Updated: 3/3/2026

Quality Score

Top 5%
36
Excellent
Based on code quality & docs
Installation
SYS Universal Install (Auto-Detect)
Cursor IDE Windsurf IDE VS Code IDE
> npx killer-skills add METR/inspect-action/debug-stuck-eval

Agent Capability Analysis

The debug-stuck-eval MCP Server by METR is an open-source Categories.community integration for Claude and other AI agents, enabling seamless task automation and capability expansion. Optimized for how to use debug-stuck-eval, debug-stuck-eval setup guide, debug-stuck-eval alternative.

Ideal Agent Persona

Ideal for AI Development Agents requiring advanced troubleshooting capabilities for stuck evaluations, particularly those working with UK AISI's Inspect in the Cloud.

Core Value

Empowers agents to rapidly diagnose and resolve stuck evaluations by leveraging hawk auth, status checks, and log analysis, utilizing JSON reports and metrics to inform the debugging process.

Capabilities Granted for debug-stuck-eval MCP Server

Debugging stuck evaluations in AI model inspections
Verifying authentication and authorization for AI services
Analyzing logs and metrics to identify error patterns in AI evaluations

! Prerequisites & Limits

  • Requires hawk auth access-token
  • Dependent on UK AISI's Inspect in the Cloud infrastructure
  • Limited to troubleshooting evaluations, not general AI model development
Project
SKILL.md
3.9 KB
.cursorrules
1.2 KB
package.json
240 B
Ready
UTF-8
SKILL.md
Readonly

Quick Checklist

  1. Verify auth: hawk auth access-token > /dev/null || echo "Run 'hawk login' first"
  2. Get eval-set-id from user
  3. Check status: hawk status <eval-set-id> - JSON report with pod state, logs, metrics
  4. View logs: hawk logs <eval-set-id> or hawk logs -f for follow mode
  5. List samples: hawk list samples <eval-set-id> - see completion status
  6. Look for error patterns (see below)
  7. Test API directly if logs show retries without clear errors

Error Patterns

Log PatternMeaningResolution
[uuid task/id/epoch model] Retrying request to /responsesOpenAI SDK retry with sample contextTest API directly with curl to see real error
[uuid task/id/epoch model] -> model retry N ... [ErrorType code]Inspect retry with error summaryCheck error type; use curl for full details
500 - Internal server errorAPI issueDownload buffer, find failing request, test through middleman AND directly to provider
400 - invalid_request_errorToken/context limit exceededCheck message count and model context window
Pod UID mismatchSandbox pod was killed and restartedNo fix needed—sample errored out, Inspect will retry
Empty output, pending: trueAPI returned malformed responseRestart eval (buffer resumes)
OOMKilled in pod statusMemory exhaustionIncrease pod memory limits

Key Techniques

  1. Retry messages have sample context - All retry messages include a [sample_uuid task/sample_id/epoch model] prefix. Inspect's own retries also include a compact error summary suffix like [RateLimitError 429 rate_limit_exceeded]. The OpenAI SDK's internal retry messages still don't show the actual error — use curl for full details.
  2. FAIL-OK patterns are fine - Alternating failures and successes mean the eval IS progressing. Only worry about consistent FAIL-FAIL-FAIL patterns.
  3. Use S3 for buffer access - Download .buffer/ from S3 rather than accessing the runner pod directly.
  4. Read .eval files with inspect_ai - Use from inspect_ai.log import read_eval_log instead of manually extracting zips.

Test API Directly

Middleman is the auth proxy. If middleman fails but direct provider calls work, it's a middleman issue.

bash
1TOKEN=$(hawk auth access-token) 2 3# Test through middleman 4curl --max-time 300 -X POST https://middleman.internal.metr.org/anthropic/v1/messages \ 5 -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \ 6 -d '{"model": "claude-sonnet-4-20250514", "max_tokens": 100, "messages": [{"role": "user", "content": "Say hello"}]}' 7 8# Test OpenAI-compatible 9curl --max-time 300 -X POST https://middleman.internal.metr.org/openai/v1/chat/completions \ 10 -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \ 11 -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Say hello"}], "max_tokens": 100}'

Recovery

bash
1# Delete stuck eval and restart 2hawk delete <eval-set-id> 3hawk eval-set <config.yaml>

The sample buffer in S3 allows Inspect to resume from where it left off (unless you use --no-resume).

HTTP Retry Count

Task progress logs include "HTTP retries: X". High retry counts indicate API instability even while tasks complete.

Severity: Retry count × wait time = stuck duration. E.g., 45 retries × 1800s = 22+ hours stuck.

More Details

See docs/debugging-stuck-evals.md for:

  • Sample buffer SQL queries
  • Detailed API testing examples
  • Escalation checklist

References

Filing Issues

Related Skills

Looking for an alternative to debug-stuck-eval or building a Categories.community AI Agent? Explore these related open-source MCP Servers.

View All

widget-generator

Logo of f
f

widget-generator is an open-source AI agent skill for creating widget plugins that are injected into prompt feeds on prompts.chat. It supports two rendering modes: standard prompt widgets using default PromptCard styling and custom render widgets built as full React components.

149.6k
0
Design

chat-sdk

Logo of lobehub
lobehub

chat-sdk is a unified TypeScript SDK for building chat bots across multiple platforms, providing a single interface for deploying bot logic.

73.0k
0
Communication

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication