KS
Killer-Skills

alerting-oncall — how to use alerting-oncall how to use alerting-oncall, alerting-oncall setup guide, alerting-oncall vs PagerDuty, alerting-oncall alternative to Opsgenie, what is alerting-oncall, alerting-oncall install, alerting-oncall and Prometheus integration, alerting-oncall for incident response, alerting-oncall and on-call management best practices

v1.0
GitHub

About this Skill

Essential for DevOps Automation Agents managing production incident response workflows. alerting-oncall is a skill for configuring effective alerting and on-call management, integrating with monitoring systems like Prometheus and on-call platforms like PagerDuty.

Features

Configures alerting rules and thresholds using Prometheus
Sets up on-call rotations and schedules with PagerDuty
Implements alert routing and escalation using Opsgenie
Reduces alert fatigue through optimized alerting workflows
Manages incident response workflows with Slack and Grafana OnCall

# Core Topics

allthingslinux allthingslinux
[0]
[0]
Updated: 3/6/2026

Quality Score

Top 5%
65
Excellent
Based on code quality & docs
Installation
SYS Universal Install (Auto-Detect)
Cursor IDE Windsurf IDE VS Code IDE
> npx killer-skills add allthingslinux/atl.services/alerting-oncall

Agent Capability Analysis

The alerting-oncall MCP Server by allthingslinux is an open-source Categories.community integration for Claude and other AI agents, enabling seamless task automation and capability expansion. Optimized for how to use alerting-oncall, alerting-oncall setup guide, alerting-oncall vs PagerDuty.

Ideal Agent Persona

Essential for DevOps Automation Agents managing production incident response workflows.

Core Value

Enables automated configuration of alerting rules, thresholds, and on-call rotations using platforms like PagerDuty, Opsgenie, and Grafana OnCall. Reduces alert fatigue through intelligent routing and escalation policies integrated with monitoring systems such as Prometheus and Datadog.

Capabilities Granted for alerting-oncall MCP Server

Configuring alert routing and escalation policies
Setting up on-call rotations and schedules
Implementing alert fatigue reduction strategies
Managing incident response workflows

! Prerequisites & Limits

  • Requires integration with monitoring systems (Prometheus/Datadog)
  • Depends on on-call platform APIs (PagerDuty/Opsgenie)
  • Needs communication channel access (Slack/teams)
Project
SKILL.md
11.4 KB
.cursorrules
1.2 KB
package.json
240 B
Ready
UTF-8

# Tags

[No tags]
SKILL.md
Readonly

Alerting & On-Call

Configure effective alerting and on-call management for production systems.

When to Use This Skill

Use this skill when:

  • Setting up alerting rules and thresholds
  • Configuring on-call rotations and schedules
  • Implementing alert routing and escalation
  • Reducing alert fatigue
  • Managing incident response workflows

Prerequisites

  • Monitoring system (Prometheus, Datadog, etc.)
  • On-call platform (PagerDuty, Opsgenie, Grafana OnCall)
  • Communication channels (Slack, email)

Alerting Best Practices

Alert Categories

yaml
1# Severity levels 2critical: 3 - Service completely down 4 - Data loss imminent 5 - Security breach 6 response: Immediate page, wake people up 7 8high: 9 - Service degraded significantly 10 - Error rate above SLO 11 - Capacity near limit 12 response: Page during business hours, notify after hours 13 14medium: 15 - Performance degradation 16 - Non-critical component failure 17 - Warning thresholds exceeded 18 response: Notify via Slack, review next business day 19 20low: 21 - Informational alerts 22 - Capacity planning triggers 23 - Routine maintenance needed 24 response: Email notification, weekly review

Alert Design Principles

yaml
1# Good alert characteristics 2alerts: 3 actionable: 4 - Every alert should require human action 5 - Include runbook links 6 - Clear remediation steps 7 8 relevant: 9 - Alert on symptoms, not causes 10 - Focus on user impact 11 - Avoid alerting on expected behavior 12 13 timely: 14 - Appropriate thresholds 15 - Suitable evaluation windows 16 - Account for normal variance 17 18 unique: 19 - No duplicate alerts 20 - Proper alert grouping 21 - Clear ownership

Prometheus Alerting

Alert Rules

yaml
1# prometheus/rules/alerts.yml 2groups: 3 - name: service_alerts 4 rules: 5 # High-level service health 6 - alert: ServiceDown 7 expr: up{job="myapp"} == 0 8 for: 1m 9 labels: 10 severity: critical 11 annotations: 12 summary: "Service {{ $labels.instance }} is down" 13 description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute." 14 runbook_url: "https://wiki.example.com/runbooks/service-down" 15 16 # Error rate alert 17 - alert: HighErrorRate 18 expr: | 19 sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) 20 / sum(rate(http_requests_total[5m])) by (service) > 0.05 21 for: 5m 22 labels: 23 severity: critical 24 annotations: 25 summary: "High error rate for {{ $labels.service }}" 26 description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes" 27 28 # Latency alert (SLO-based) 29 - alert: HighLatency 30 expr: | 31 histogram_quantile(0.95, 32 sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) 33 ) > 0.5 34 for: 5m 35 labels: 36 severity: high 37 annotations: 38 summary: "P95 latency above 500ms for {{ $labels.service }}"

Alertmanager Configuration

yaml
1# alertmanager.yml 2global: 3 resolve_timeout: 5m 4 slack_api_url: 'https://hooks.slack.com/services/xxx' 5 pagerduty_url: 'https://events.pagerduty.com/v2/enqueue' 6 7templates: 8 - '/etc/alertmanager/templates/*.tmpl' 9 10route: 11 receiver: 'default-receiver' 12 group_by: ['alertname', 'service'] 13 group_wait: 30s 14 group_interval: 5m 15 repeat_interval: 4h 16 17 routes: 18 # Critical alerts go to PagerDuty 19 - match: 20 severity: critical 21 receiver: 'pagerduty-critical' 22 group_wait: 0s 23 repeat_interval: 1h 24 25 # High severity during business hours 26 - match: 27 severity: high 28 receiver: 'slack-high' 29 active_time_intervals: 30 - business-hours 31 32 # Route by team 33 - match_re: 34 team: platform.* 35 receiver: 'platform-team' 36 37receivers: 38 - name: 'default-receiver' 39 slack_configs: 40 - channel: '#alerts' 41 send_resolved: true 42 43 - name: 'pagerduty-critical' 44 pagerduty_configs: 45 - service_key: 'xxx' 46 severity: critical 47 description: '{{ .CommonAnnotations.summary }}' 48 details: 49 firing: '{{ template "pagerduty.firing" . }}' 50 51 - name: 'slack-high' 52 slack_configs: 53 - channel: '#alerts-high' 54 title: '{{ .CommonAnnotations.summary }}' 55 text: '{{ .CommonAnnotations.description }}' 56 actions: 57 - type: button 58 text: 'Runbook' 59 url: '{{ .CommonAnnotations.runbook_url }}' 60 - type: button 61 text: 'Dashboard' 62 url: '{{ .CommonAnnotations.dashboard_url }}' 63 64 - name: 'platform-team' 65 slack_configs: 66 - channel: '#platform-alerts' 67 68time_intervals: 69 - name: business-hours 70 time_intervals: 71 - weekdays: ['monday:friday'] 72 times: 73 - start_time: '09:00' 74 end_time: '17:00' 75 76inhibit_rules: 77 - source_match: 78 severity: critical 79 target_match: 80 severity: high 81 equal: ['service']

PagerDuty Integration

Service Configuration

yaml
1# Terraform example 2resource "pagerduty_service" "myapp" { 3 name = "MyApp Production" 4 description = "Production application service" 5 escalation_policy = pagerduty_escalation_policy.default.id 6 alert_creation = "create_alerts_and_incidents" 7 auto_resolve_timeout = 14400 # 4 hours 8 acknowledgement_timeout = 600 # 10 minutes 9 10 incident_urgency_rule { 11 type = "use_support_hours" 12 13 during_support_hours { 14 type = "constant" 15 urgency = "high" 16 } 17 18 outside_support_hours { 19 type = "constant" 20 urgency = "low" 21 } 22 } 23} 24 25resource "pagerduty_escalation_policy" "default" { 26 name = "Default Escalation" 27 num_loops = 2 28 29 rule { 30 escalation_delay_in_minutes = 10 31 target { 32 type = "schedule_reference" 33 id = pagerduty_schedule.primary.id 34 } 35 } 36 37 rule { 38 escalation_delay_in_minutes = 15 39 target { 40 type = "user_reference" 41 id = pagerduty_user.manager.id 42 } 43 } 44}

Schedule Configuration

yaml
1resource "pagerduty_schedule" "primary" { 2 name = "Primary On-Call" 3 time_zone = "America/New_York" 4 5 layer { 6 name = "Weekly Rotation" 7 start = "2024-01-01T00:00:00-05:00" 8 rotation_virtual_start = "2024-01-01T00:00:00-05:00" 9 rotation_turn_length_seconds = 604800 # 1 week 10 users = [for user in pagerduty_user.oncall : user.id] 11 } 12 13 # Override layer for holidays 14 layer { 15 name = "Holiday Coverage" 16 start = "2024-01-01T00:00:00-05:00" 17 rotation_virtual_start = "2024-01-01T00:00:00-05:00" 18 rotation_turn_length_seconds = 86400 19 users = [pagerduty_user.holiday_coverage.id] 20 21 restriction { 22 type = "daily_restriction" 23 start_time_of_day = "00:00:00" 24 duration_seconds = 86400 25 start_day_of_week = 0 # Sunday 26 } 27 } 28}

Grafana OnCall

Integration Setup

yaml
1# docker-compose.yml addition 2services: 3 oncall: 4 image: grafana/oncall 5 environment: 6 - SECRET_KEY=your-secret-key 7 - BASE_URL=http://oncall:8080 8 - GRAFANA_API_URL=http://grafana:3000 9 ports: 10 - "8080:8080"

Escalation Chain

yaml
1# Example escalation chain structure 2escalation_chains: 3 - name: "Production Critical" 4 steps: 5 - step: 1 6 type: notify 7 persons: 8 - "@oncall-primary" 9 wait_delay: 0 10 11 - step: 2 12 type: notify 13 persons: 14 - "@oncall-secondary" 15 wait_delay: 5m 16 17 - step: 3 18 type: notify 19 persons: 20 - "@engineering-manager" 21 wait_delay: 10m 22 23 - step: 4 24 type: trigger_action 25 action: "escalate_to_incident_commander" 26 wait_delay: 15m

Alert Templates

Slack Alert Template

go
1{{ define "slack.title" }} 2[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} 3{{ end }} 4 5{{ define "slack.text" }} 6{{ range .Alerts }} 7*Alert:* {{ .Annotations.summary }} 8*Severity:* {{ .Labels.severity }} 9*Description:* {{ .Annotations.description }} 10*Runbook:* {{ .Annotations.runbook_url }} 11{{ end }} 12{{ end }}

PagerDuty Details Template

go
1{{ define "pagerduty.firing" }} 2{{ range .Alerts.Firing }} 3Alert: {{ .Labels.alertname }} 4Service: {{ .Labels.service }} 5Instance: {{ .Labels.instance }} 6Value: {{ .Annotations.value }} 7Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }} 8{{ end }} 9{{ end }}

On-Call Best Practices

Rotation Guidelines

yaml
1on_call_guidelines: 2 rotation_length: 1 week 3 handoff_time: "10:00 AM Monday" 4 5 responsibilities: 6 - Monitor alerts during shift 7 - Respond within SLA (critical: 5min, high: 15min) 8 - Document incidents 9 - Handoff unresolved issues 10 11 support: 12 - Secondary on-call for backup 13 - Clear escalation path 14 - Manager availability for major incidents 15 16 wellness: 17 - Maximum 1 week on-call per month 18 - Comp time after high-alert periods 19 - No-interrupt recovery day after shift

Runbook Template

markdown
1# Alert: High Error Rate 2 3## Summary 4Error rate has exceeded the threshold of 5% for the service. 5 6## Impact 7Users may experience errors when accessing the application. 8 9## Investigation Steps 101. Check service logs: `kubectl logs -l app=myapp -n production` 112. Review recent deployments: `kubectl rollout history deployment/myapp` 123. Check database connectivity: `kubectl exec -it myapp -- nc -zv postgres 5432` 134. Review error traces in APM dashboard 14 15## Remediation 16### If caused by recent deployment: 17```bash 18kubectl rollout undo deployment/myapp -n production

If database related:

bash
1kubectl delete pod -l app=postgres -n production

Escalation

If not resolved within 15 minutes, escalate to:

  • Database team: @db-oncall
  • Platform team: @platform-oncall

## Alert Fatigue Reduction

### Strategies

```yaml
fatigue_reduction:
  aggregate_alerts:
    - Group related alerts
    - Use inhibit rules
    - Implement alert correlation
    
  tune_thresholds:
    - Base on SLOs, not arbitrary values
    - Account for normal variance
    - Use appropriate evaluation windows
    
  automate_responses:
    - Auto-remediation for known issues
    - Self-healing infrastructure
    - Automated scaling
    
  regular_review:
    - Weekly alert review
    - Remove unused alerts
    - Update thresholds based on data

Common Issues

Issue: Alert Storm

Problem: Too many alerts firing simultaneously Solution: Implement proper grouping and inhibition rules

Issue: Missed Alerts

Problem: Critical alerts not reaching on-call Solution: Test escalation policies, verify contact methods

Issue: False Positives

Problem: Alerts firing without actual issues Solution: Tune thresholds, increase evaluation windows

Best Practices

  • Define clear severity levels
  • Every alert needs a runbook
  • Test on-call notifications regularly
  • Review and tune alerts weekly
  • Implement proper escalation paths
  • Use alert grouping and inhibition
  • Track alert metrics (MTTR, frequency)
  • Practice incident response regularly

Related Skills

Related Skills

Looking for an alternative to alerting-oncall or building a Categories.community AI Agent? Explore these related open-source MCP Servers.

View All

widget-generator

Logo of f
f

widget-generator is an open-source AI agent skill for creating widget plugins that are injected into prompt feeds on prompts.chat. It supports two rendering modes: standard prompt widgets using default PromptCard styling and custom render widgets built as full React components.

149.6k
0
Design

chat-sdk

Logo of lobehub
lobehub

chat-sdk is a unified TypeScript SDK for building chat bots across multiple platforms, providing a single interface for deploying bot logic.

73.0k
0
Communication

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication