KS
Killer-Skills

sre — how to use sre for Kubernetes debugging how to use sre for Kubernetes debugging, sre vs traditional debugging methods, sre setup guide for AI agents, what is sre in devops, sre alternative for Kubernetes incident management, kubernetes debugging best practices with sre

v1.0.0
GitHub

About this Skill

Perfect for DevOps Agents needing advanced Kubernetes incident debugging capabilities using 5 Whys Analysis and Multi-Source Correlation. sre is a skill for debugging Kubernetes incidents, focusing on root cause analysis and multi-source correlation for efficient issue resolution

Features

Applies 5 Whys Analysis for root cause identification
Utilizes Read-Only Investigation for observing and analyzing Kubernetes resources
Combines logs, events, and metrics for Multi-Source Correlation
Integrates with k8s skill for cluster access and KUBECONFIG patterns
Supports direct mutations for debugging in dev clusters

# Core Topics

ionfury ionfury
[22]
[3]
Updated: 2/25/2026

Quality Score

Top 5%
50
Excellent
Based on code quality & docs
Installation
SYS Universal Install (Auto-Detect)
Cursor IDE Windsurf IDE VS Code IDE
> npx killer-skills add ionfury/homelab

Agent Capability Analysis

The sre MCP Server by ionfury is an open-source Categories.community integration for Claude and other AI agents, enabling seamless task automation and capability expansion. Optimized for how to use sre for Kubernetes debugging, sre vs traditional debugging methods, sre setup guide for AI agents.

Ideal Agent Persona

Perfect for DevOps Agents needing advanced Kubernetes incident debugging capabilities using 5 Whys Analysis and Multi-Source Correlation.

Core Value

Empowers agents to debug Kubernetes incidents by combining logs, events, and metrics for complete analysis, leveraging read-only investigation and cluster access via KUBECONFIG patterns.

Capabilities Granted for sre MCP Server

Debugging Kubernetes incidents using 5 Whys Analysis
Correlating multi-source data for root cause identification
Analyzing cluster access and internal service URLs for issue resolution

! Prerequisites & Limits

  • Requires cluster access and KUBECONFIG patterns
  • Read-only investigation required for live/integration environments
  • Direct mutations only permitted in dev cluster for debugging
Project
SKILL.md
12.5 KB
.cursorrules
1.2 KB
package.json
240 B
Ready
UTF-8
SKILL.md
Readonly

Cluster access, KUBECONFIG patterns, and internal service URLs are in the k8s skill.

Debugging Kubernetes Incidents

Core Principles

  • 5 Whys Analysis - NEVER stop at symptoms. Ask "why" until you reach the root cause.
  • Read-Only Investigation - Observe and analyze, never modify resources on integration/live. Dev cluster permits direct mutations for debugging (see troubleshooter agent boundaries)
  • Multi-Source Correlation - Combine logs, events, metrics for complete picture
  • Research Unknown Services - Check documentation before deep investigation
  • Zero Alert Tolerance - Every firing alert must be addressed immediately: fix the root cause, or as a last resort, create a declarative Silence CR with justification. Never ignore, defer, or dismiss a firing alert.

The 5 Whys Analysis (CRITICAL)

You MUST apply 5 Whys before concluding any investigation. Stopping at symptoms leads to ineffective fixes.

How to Apply

  1. Start with the observed symptom
  2. Ask "Why did this happen?" for each answer
  3. Continue until you reach an actionable root cause (typically 5 levels)

Example

Symptom: Helm install failed with "context deadline exceeded"

Why #1: Why did Helm timeout?
  → Pods never became Ready

Why #2: Why weren't pods Ready?
  → Pods stuck in Pending state

Why #3: Why were pods Pending?
  → PVCs couldn't bind (StorageClass "fast" not found)

Why #4: Why was StorageClass missing?
  → longhorn-storage Kustomization failed to apply

Why #5: Why did the Kustomization fail?
  → numberOfReplicas was integer instead of string

ROOT CAUSE: YAML type coercion issue
FIX: Use properly typed variable for StorageClass parameters

Red Flags You Haven't Reached Root Cause

  • Your "fix" is increasing a timeout or retry count
  • Your "fix" addresses the symptom, not what caused it
  • You can still ask "but why did THAT happen?"
  • Multiple issues share the same underlying cause
BAD:  "Helm timed out → increase timeout to 15m"
GOOD: "Helm timed out → ... → Kustomization type error → fix YAML"

Investigation Phases

Phase 1: Triage

  1. Confirm cluster - Ask user: "Which cluster? (dev/integration/live)"
  2. Assess severity - P1 (down) / P2 (degraded) / P3 (minor) / P4 (cosmetic)
  3. Identify scope - Pod / Deployment / Namespace / Cluster-wide

Phase 2: Data Collection

bash
1# Pod status and events 2kubectl get pods -n <namespace> 3kubectl describe pod <pod> -n <namespace> 4 5# Logs (current and previous) 6kubectl logs <pod> -n <namespace> --tail=100 7kubectl logs <pod> -n <namespace> --previous 8 9# Events timeline 10kubectl get events -n <namespace> --sort-by='.lastTimestamp' 11 12# Resource usage 13kubectl top pods -n <namespace>

Metrics and alerts via kubectl exec (Prometheus is behind OAuth2 Proxy — DNS URLs won't work for API queries):

bash
1# Check firing alerts 2KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus -- \ 3 wget -qO- 'http://localhost:9090/api/v1/alerts' | jq '.data.alerts[] | select(.state == "firing")' 4 5# Pod restart metrics 6KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus -- \ 7 wget -qO- 'http://localhost:9090/api/v1/query?query=increase(kube_pod_container_status_restarts_total[1h])>0' | jq '.data.result'

Phase 3: Correlation

  1. Extract timestamps from logs, events, metrics
  2. Identify what happened FIRST (root cause)
  3. Trace the cascade of effects

Phase 4: Root Cause (5 Whys)

Apply 5 Whys analysis. Validate:

  • Temporal: Did it happen BEFORE the symptom?
  • Causal: Does it logically explain the symptom?
  • Evidence: Is there supporting data?
  • Complete: Have you asked "why" enough times?

Phase 5: Remediation

Use AskUserQuestion tool to present fix options when multiple valid approaches exist.

Provide recommendations only (read-only investigation):

  • Immediate: Rollback, scale, restart
  • Permanent: Code/config fixes
  • Prevention: Alerts, quotas, tests

Quick Diagnosis

SymptomFirst CheckCommon Cause
ImagePullBackOffdescribe pod eventsWrong image/registry auth
PendingEvents, node capacityInsufficient resources
CrashLoopBackOfflogs --previousApp error, missing config
OOMKilledMemory limitsMemory leak, limits too low
UnhealthyProbe configSlow startup, wrong endpoint
Service unreachableHubble dropped trafficNetwork policy blocking
Can't reach databaseHubble + namespace labelsMissing access label
Gateway returns 503Hubble from istio-gatewayMissing profile label

Common Failure Chains

Storage failures cascade:

StorageClass missing → PVC Pending → Pod Pending → Helm timeout

Network failures cascade:

DNS failure → Service unreachable → Health check fails → Pod restarted

Network policy failures cascade:

Missing namespace profile label → No ingress allowed → Service unreachable from gateway
Missing access label → Can't reach database → App fails health checks → CrashLoopBackOff

Secret failures cascade:

ExternalSecret fails → Secret missing → Pod CrashLoopBackOff

Network Policy Debugging (Cilium + Hubble)

Network policies are ENFORCED - all traffic is implicitly denied unless allowed.

Check for Blocked Traffic

bash
1# Setup Hubble access (run once per session) 2KUBECONFIG=~/.kube/<cluster>.yaml kubectl port-forward -n kube-system svc/hubble-relay 4245:80 & 3 4# See dropped traffic in a namespace 5hubble observe --verdict DROPPED --namespace <namespace> --since 5m 6 7# See what's trying to reach a service 8hubble observe --to-namespace <namespace> --verdict DROPPED --since 5m 9 10# Check specific traffic flow 11hubble observe --from-namespace <source> --to-namespace <dest> --since 5m

Common Network Policy Issues

SymptomCheckFix
Service unreachable from gatewaykubectl get ns <ns> --show-labelsAdd profile label
Can't reach databaseCheck access.network-policy.homelab/postgres labelAdd access label
Pods can't resolve DNSHubble DNS drops (rare - baseline allows)Check for custom egress blocking
Inter-pod communication failsHubble intra-namespace dropsBaseline should allow - check for overrides

Namespace Labels Checklist

bash
1# Check namespace has required labels 2KUBECONFIG=~/.kube/<cluster>.yaml kubectl get ns <namespace> -o jsonpath='{.metadata.labels}' | jq 3 4# Required for app namespaces: 5# - network-policy.homelab/profile: standard|internal|internal-egress|isolated 6 7# Optional access labels: 8# - access.network-policy.homelab/postgres: "true" 9# - access.network-policy.homelab/garage-s3: "true" 10# - access.network-policy.homelab/kube-api: "true"

Emergency: Disable Network Policies

bash
1# Escape hatch - disables enforcement for namespace (triggers alert after 5m) 2KUBECONFIG=~/.kube/<cluster>.yaml kubectl label namespace <ns> network-policy.homelab/enforcement=disabled 3 4# Re-enable after fixing 5KUBECONFIG=~/.kube/<cluster>.yaml kubectl label namespace <ns> network-policy.homelab/enforcement-

See docs/runbooks/network-policy-escape-hatch.md for full procedure.

Kickstarting Stalled HelmReleases

HelmReleases can get stuck in a Stalled state with RetriesExceeded even after the underlying issue is resolved. This happens because:

  1. The HR hit its retry limit (default: 4 attempts)
  2. The failure counter persists even if pods are now healthy
  3. Flux won't auto-retry once Stalled condition is set

Symptoms:

STATUS: Stalled
MESSAGE: Failed to install after 4 attempt(s)
REASON: RetriesExceeded

Diagnosis: Check if the underlying resources are actually healthy:

bash
1# HR shows Failed, but check if pods are running 2KUBECONFIG=~/.kube/<cluster>.yaml kubectl get pods -n <namespace> -l app.kubernetes.io/name=<app> 3 4# If pods are Running but HR is Stalled, the HR just needs a reset

Fix: Suspend and resume to reset the failure counter:

bash
1KUBECONFIG=~/.kube/<cluster>.yaml flux suspend helmrelease <name> -n flux-system 2KUBECONFIG=~/.kube/<cluster>.yaml flux resume helmrelease <name> -n flux-system

Common causes of initial failure (that may have self-healed):

  • Missing Secret/ConfigMap (ExternalSecret eventually created it)
  • Missing CRD (operator finished installing)
  • Transient network issues during image pull
  • Resource quota temporarily exceeded

Prevention: Ensure proper dependsOn ordering so prerequisites are ready before HelmRelease installs.

Promotion Pipeline Debugging

Symptom: "Live cluster not updating after merge"

The OCI artifact promotion pipeline has multiple stages where failures can stall deployment. Walk through each stage in order to find where the pipeline is stuck.

Diagnostic Steps

1. PR merged to main
   └─ Check: Did build-platform-artifact.yaml trigger?
      └─ GitHub Actions → "Build Platform Artifact" workflow
      └─ If missing: Was kubernetes/ modified? (paths filter)

2. OCI artifact built
   └─ Check: Is the artifact in GHCR with integration-* tag?
      └─ flux list artifact oci://ghcr.io/<repo>/platform | grep integration

3. Integration cluster picks up artifact
   └─ Check: Does OCIRepository see the new version?
      └─ KUBECONFIG=~/.kube/integration.yaml kubectl get ocirepository -n flux-system
      └─ Look at .status.artifact.revision — does it match the new RC tag?
      └─ If stale: Check semver constraint (must be ">= 0.0.0-0" to accept RCs)

4. Integration reconciliation succeeds
   └─ Check: Platform Kustomization healthy?
      └─ KUBECONFIG=~/.kube/integration.yaml flux get kustomizations -n flux-system
      └─ If failed: Read Kustomization events for the error

5. Flux Alert fires repository_dispatch
   └─ Check: Did the Alert fire?
      └─ KUBECONFIG=~/.kube/integration.yaml kubectl describe alert validation-success -n flux-system
      └─ Check Provider (GitHub) status:
         KUBECONFIG=~/.kube/integration.yaml kubectl get providers -n flux-system

6. tag-validated-artifact.yaml runs
   └─ Check: Did the workflow trigger?
      └─ GitHub Actions → "Tag Validated Artifact" workflow
      └─ If not triggered: repository_dispatch may not have fired
         (check Provider secret has repo scope)
      └─ If triggered but failed: Check workflow logs for tagging errors

7. Live cluster picks up validated artifact
   └─ Check: Does OCIRepository see the stable semver?
      └─ KUBECONFIG=~/.kube/live.yaml kubectl get ocirepository -n flux-system
      └─ Semver constraint must be ">= 0.0.0" (stable only, no RCs)

Common Failure Modes

StageSymptomCommon Cause
BuildWorkflow did not triggerkubernetes/ not in changed paths
BuildArtifact push failedGHCR auth issue (GITHUB_TOKEN permissions)
IntegrationOCIRepository not updatingSemver constraint mismatch (not accepting RCs)
ValidationKustomization failedActual config error in the merged PR
Promotionrepository_dispatch not receivedProvider secret missing repo scope
PromotionWorkflow skipped (idempotency guard)Artifact already tagged as validated
LiveOCIRepository not updatingStable semver tag not created by tag workflow

Manual Promotion (Emergency)

If the pipeline is stuck and live needs the update:

bash
1# Authenticate to GHCR 2echo $GITHUB_TOKEN | docker login ghcr.io -u $GITHUB_USER --password-stdin 3 4# Find the integration artifact 5flux list artifact oci://ghcr.io/<repo>/platform | grep integration 6 7# Manually tag as validated + stable semver 8flux tag artifact oci://ghcr.io/<repo>/platform:<rc-tag> --tag <stable-semver>

See .github/CLAUDE.md for full pipeline architecture and rollback procedures.

Common Confusions

BAD: Jump to logs without checking events first GOOD: Events provide context, then investigate logs

BAD: Look only at current pod state GOOD: Check --previous logs if pod restarted

BAD: Assume first error is root cause GOOD: Apply 5 Whys to find true root cause

BAD: Investigate without confirming cluster GOOD: ALWAYS confirm cluster before any kubectl command

Keywords

kubernetes, debugging, crashloopbackoff, oomkilled, pending, root cause analysis, 5 whys, incident investigation, pod logs, events, troubleshooting, network policy, hubble, stalled helmrelease, promotion pipeline, live not updating

Related Skills

Looking for an alternative to sre or building a Categories.community AI Agent? Explore these related open-source MCP Servers.

View All

widget-generator

Logo of f
f

widget-generator is an open-source AI agent skill for creating widget plugins that are injected into prompt feeds on prompts.chat. It supports two rendering modes: standard prompt widgets using default PromptCard styling and custom render widgets built as full React components.

149.6k
0
Design

chat-sdk

Logo of lobehub
lobehub

chat-sdk is a unified TypeScript SDK for building chat bots across multiple platforms, providing a single interface for deploying bot logic.

73.0k
0
Communication

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication