What is alerting-oncall?

Perfect for DevOps Agents needing advanced alerting and on-call management capabilities for production systems. Set up alerting rules, configure on-call rotations, and manage incident response workflows. Integrate with PagerDuty, Opsgenie, or Grafana OnCall for alert routing and escalation. Use when implementin

How do I install alerting-oncall?

Run the command: npx killer-skills add allthingslinux/atl.services/alerting-oncall. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for alerting-oncall?

Key use cases include: Configuring on-call rotations and schedules for production systems, Implementing alert routing and escalation protocols to reduce alert fatigue, Managing incident response workflows with communication channels like Slack.

Which IDEs are compatible with alerting-oncall?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for alerting-oncall?

Requires a monitoring system like Prometheus or Datadog. Needs an on-call platform like PagerDuty, Opsgenie, or Grafana OnCall. Depends on communication channels like Slack for alert notifications.

Alerting & On-Call

Name: alerting-oncall
Availability: InStock
Author: allthingslinux

Configure effective alerting and on-call management for production systems.

When to Use This Skill

Use this skill when:

Setting up alerting rules and thresholds
Configuring on-call rotations and schedules
Implementing alert routing and escalation
Reducing alert fatigue
Managing incident response workflows

Prerequisites

Monitoring system (Prometheus, Datadog, etc.)
On-call platform (PagerDuty, Opsgenie, Grafana OnCall)
Communication channels (Slack, email)

Alerting Best Practices

Alert Categories

yaml
1# Severity levels
2critical:
3  - Service completely down
4  - Data loss imminent
5  - Security breach
6  response: Immediate page, wake people up
7
8high:
9  - Service degraded significantly
10  - Error rate above SLO
11  - Capacity near limit
12  response: Page during business hours, notify after hours
13
14medium:
15  - Performance degradation
16  - Non-critical component failure
17  - Warning thresholds exceeded
18  response: Notify via Slack, review next business day
19
20low:
21  - Informational alerts
22  - Capacity planning triggers
23  - Routine maintenance needed
24  response: Email notification, weekly review

Alert Design Principles

yaml
1# Good alert characteristics
2alerts:
3  actionable:
4    - Every alert should require human action
5    - Include runbook links
6    - Clear remediation steps
7
8  relevant:
9    - Alert on symptoms, not causes
10    - Focus on user impact
11    - Avoid alerting on expected behavior
12
13  timely:
14    - Appropriate thresholds
15    - Suitable evaluation windows
16    - Account for normal variance
17
18  unique:
19    - No duplicate alerts
20    - Proper alert grouping
21    - Clear ownership

Prometheus Alerting

Alert Rules

yaml
1# prometheus/rules/alerts.yml
2groups:
3  - name: service_alerts
4    rules:
5      # High-level service health
6      - alert: ServiceDown
7        expr: up{job="myapp"} == 0
8        for: 1m
9        labels:
10          severity: critical
11        annotations:
12          summary: "Service {{ $labels.instance }} is down"
13          description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute."
14          runbook_url: "https://wiki.example.com/runbooks/service-down"
15
16      # Error rate alert
17      - alert: HighErrorRate
18        expr: |
19          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
20          / sum(rate(http_requests_total[5m])) by (service) > 0.05
21        for: 5m
22        labels:
23          severity: critical
24        annotations:
25          summary: "High error rate for {{ $labels.service }}"
26          description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"
27
28      # Latency alert (SLO-based)
29      - alert: HighLatency
30        expr: |
31          histogram_quantile(0.95, 
32            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
33          ) > 0.5
34        for: 5m
35        labels:
36          severity: high
37        annotations:
38          summary: "P95 latency above 500ms for {{ $labels.service }}"

Alertmanager Configuration

yaml
1# alertmanager.yml
2global:
3  resolve_timeout: 5m
4  slack_api_url: 'https://hooks.slack.com/services/xxx'
5  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
6
7templates:
8  - '/etc/alertmanager/templates/*.tmpl'
9
10route:
11  receiver: 'default-receiver'
12  group_by: ['alertname', 'service']
13  group_wait: 30s
14  group_interval: 5m
15  repeat_interval: 4h
16  
17  routes:
18    # Critical alerts go to PagerDuty
19    - match:
20        severity: critical
21      receiver: 'pagerduty-critical'
22      group_wait: 0s
23      repeat_interval: 1h
24
25    # High severity during business hours
26    - match:
27        severity: high
28      receiver: 'slack-high'
29      active_time_intervals:
30        - business-hours
31
32    # Route by team
33    - match_re:
34        team: platform.*
35      receiver: 'platform-team'
36
37receivers:
38  - name: 'default-receiver'
39    slack_configs:
40      - channel: '#alerts'
41        send_resolved: true
42
43  - name: 'pagerduty-critical'
44    pagerduty_configs:
45      - service_key: 'xxx'
46        severity: critical
47        description: '{{ .CommonAnnotations.summary }}'
48        details:
49          firing: '{{ template "pagerduty.firing" . }}'
50
51  - name: 'slack-high'
52    slack_configs:
53      - channel: '#alerts-high'
54        title: '{{ .CommonAnnotations.summary }}'
55        text: '{{ .CommonAnnotations.description }}'
56        actions:
57          - type: button
58            text: 'Runbook'
59            url: '{{ .CommonAnnotations.runbook_url }}'
60          - type: button
61            text: 'Dashboard'
62            url: '{{ .CommonAnnotations.dashboard_url }}'
63
64  - name: 'platform-team'
65    slack_configs:
66      - channel: '#platform-alerts'
67
68time_intervals:
69  - name: business-hours
70    time_intervals:
71      - weekdays: ['monday:friday']
72        times:
73          - start_time: '09:00'
74            end_time: '17:00'
75
76inhibit_rules:
77  - source_match:
78      severity: critical
79    target_match:
80      severity: high
81    equal: ['service']

PagerDuty Integration

Service Configuration

yaml
1# Terraform example
2resource "pagerduty_service" "myapp" {
3  name                    = "MyApp Production"
4  description             = "Production application service"
5  escalation_policy       = pagerduty_escalation_policy.default.id
6  alert_creation          = "create_alerts_and_incidents"
7  auto_resolve_timeout    = 14400  # 4 hours
8  acknowledgement_timeout = 600    # 10 minutes
9
10  incident_urgency_rule {
11    type    = "use_support_hours"
12    
13    during_support_hours {
14      type    = "constant"
15      urgency = "high"
16    }
17    
18    outside_support_hours {
19      type    = "constant"
20      urgency = "low"
21    }
22  }
23}
24
25resource "pagerduty_escalation_policy" "default" {
26  name      = "Default Escalation"
27  num_loops = 2
28
29  rule {
30    escalation_delay_in_minutes = 10
31    target {
32      type = "schedule_reference"
33      id   = pagerduty_schedule.primary.id
34    }
35  }
36
37  rule {
38    escalation_delay_in_minutes = 15
39    target {
40      type = "user_reference"
41      id   = pagerduty_user.manager.id
42    }
43  }
44}

Schedule Configuration

yaml
1resource "pagerduty_schedule" "primary" {
2  name      = "Primary On-Call"
3  time_zone = "America/New_York"
4
5  layer {
6    name                         = "Weekly Rotation"
7    start                        = "2024-01-01T00:00:00-05:00"
8    rotation_virtual_start       = "2024-01-01T00:00:00-05:00"
9    rotation_turn_length_seconds = 604800  # 1 week
10    users                        = [for user in pagerduty_user.oncall : user.id]
11  }
12
13  # Override layer for holidays
14  layer {
15    name                         = "Holiday Coverage"
16    start                        = "2024-01-01T00:00:00-05:00"
17    rotation_virtual_start       = "2024-01-01T00:00:00-05:00"
18    rotation_turn_length_seconds = 86400
19    users                        = [pagerduty_user.holiday_coverage.id]
20
21    restriction {
22      type              = "daily_restriction"
23      start_time_of_day = "00:00:00"
24      duration_seconds  = 86400
25      start_day_of_week = 0  # Sunday
26    }
27  }
28}

Grafana OnCall

Integration Setup

yaml
1# docker-compose.yml addition
2services:
3  oncall:
4    image: grafana/oncall
5    environment:
6      - SECRET_KEY=your-secret-key
7      - BASE_URL=http://oncall:8080
8      - GRAFANA_API_URL=http://grafana:3000
9    ports:
10      - "8080:8080"

Escalation Chain

yaml
1# Example escalation chain structure
2escalation_chains:
3  - name: "Production Critical"
4    steps:
5      - step: 1
6        type: notify
7        persons:
8          - "@oncall-primary"
9        wait_delay: 0
10        
11      - step: 2
12        type: notify
13        persons:
14          - "@oncall-secondary"
15        wait_delay: 5m
16        
17      - step: 3
18        type: notify
19        persons:
20          - "@engineering-manager"
21        wait_delay: 10m
22        
23      - step: 4
24        type: trigger_action
25        action: "escalate_to_incident_commander"
26        wait_delay: 15m

Alert Templates

Slack Alert Template

go
1{{ define "slack.title" }}
2[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
3{{ end }}
4
5{{ define "slack.text" }}
6{{ range .Alerts }}
7*Alert:* {{ .Annotations.summary }}
8*Severity:* {{ .Labels.severity }}
9*Description:* {{ .Annotations.description }}
10*Runbook:* {{ .Annotations.runbook_url }}
11{{ end }}
12{{ end }}

PagerDuty Details Template

go
1{{ define "pagerduty.firing" }}
2{{ range .Alerts.Firing }}
3Alert: {{ .Labels.alertname }}
4Service: {{ .Labels.service }}
5Instance: {{ .Labels.instance }}
6Value: {{ .Annotations.value }}
7Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
8{{ end }}
9{{ end }}

On-Call Best Practices

Rotation Guidelines

yaml
1on_call_guidelines:
2  rotation_length: 1 week
3  handoff_time: "10:00 AM Monday"
4  
5  responsibilities:
6    - Monitor alerts during shift
7    - Respond within SLA (critical: 5min, high: 15min)
8    - Document incidents
9    - Handoff unresolved issues
10    
11  support:
12    - Secondary on-call for backup
13    - Clear escalation path
14    - Manager availability for major incidents
15    
16  wellness:
17    - Maximum 1 week on-call per month
18    - Comp time after high-alert periods
19    - No-interrupt recovery day after shift

Runbook Template

markdown
1# Alert: High Error Rate
2
3## Summary
4Error rate has exceeded the threshold of 5% for the service.
5
6## Impact
7Users may experience errors when accessing the application.
8
9## Investigation Steps
101. Check service logs: `kubectl logs -l app=myapp -n production`
112. Review recent deployments: `kubectl rollout history deployment/myapp`
123. Check database connectivity: `kubectl exec -it myapp -- nc -zv postgres 5432`
134. Review error traces in APM dashboard
14
15## Remediation
16### If caused by recent deployment:
17```bash
18kubectl rollout undo deployment/myapp -n production

bash
1kubectl delete pod -l app=postgres -n production

Escalation

If not resolved within 15 minutes, escalate to:

Database team: @db-oncall
Platform team: @platform-oncall


## Alert Fatigue Reduction

### Strategies

```yaml
fatigue_reduction:
  aggregate_alerts:
    - Group related alerts
    - Use inhibit rules
    - Implement alert correlation
    
  tune_thresholds:
    - Base on SLOs, not arbitrary values
    - Account for normal variance
    - Use appropriate evaluation windows
    
  automate_responses:
    - Auto-remediation for known issues
    - Self-healing infrastructure
    - Automated scaling
    
  regular_review:
    - Weekly alert review
    - Remove unused alerts
    - Update thresholds based on data

Common Issues

Issue: Alert Storm

Problem: Too many alerts firing simultaneously Solution: Implement proper grouping and inhibition rules

Issue: Missed Alerts

Problem: Critical alerts not reaching on-call Solution: Test escalation policies, verify contact methods

Issue: False Positives

Problem: Alerts firing without actual issues Solution: Tune thresholds, increase evaluation windows

Best Practices

Define clear severity levels
Every alert needs a runbook
Test on-call notifications regularly
Review and tune alerts weekly
Implement proper escalation paths
Use alert grouping and inhibition
Track alert metrics (MTTR, frequency)
Practice incident response regularly

prometheus-grafana - Monitoring setup
incident-response - Incident handling
runbook-automation - Runbook creation

alerting-oncall — community alerting-oncall, atl.services, community, ide skills

About this Skill

Killer-Skills Review

Core Value

Ideal Agent Persona

↓ Capabilities Granted for alerting-oncall

! Prerequisites & Limits

Source Boundary

Decide The Next Action Before You Keep Reading Repository Material

Start With Installation And Validation

Cross-Check Against Trusted Picks

Move To Workflow Collections For Team Rollout

Browser Sandbox Environment

⚡️ Ready to unleash?

FAQ & Installation Steps

? Frequently Asked Questions