Monitoring

Effective Alerting: Less Noise, More Signal

November 20, 2024

5 min read

#alerting#oncall#observability#sre

The worst kind of alert is the one you've learned to ignore. If your team treats pages like background noise, your alerting system has failed.

The Alert Fatigue Problem

Alert fatigue happens when:

Too many alerts fire
Alerts fire for non-critical issues
The same alerts fire repeatedly without resolution
Alerts lack actionable information

The result? Engineers ignore pages. Real incidents get missed.

Principles of Good Alerting

1. Alert on Symptoms, Not Causes

Bad Alert:

Database connection pool at 80% capacity

Good Alert:

API response time P95 > 2s for 5 minutes
User-facing impact detected

The first is a potential cause. The second is a user-impacting symptom.

Why This Matters

Causes are implementation details. You might have ten different causes for slow API responses. But there's only one symptom: users are experiencing degraded service.

Alert on the symptom. Investigate causes during the incident.

2. Make Alerts Actionable

Every alert must answer three questions:

What is broken?
Why does it matter?
What should I do about it?

Example Alert Message:

CRITICAL: API Service Degraded

Impact: Users experiencing 2-5s delays on checkout
Affected: 15% of requests (EU region)
Duration: 7 minutes

Runbook: https://wiki/runbooks/api-degradation
Dashboard: https://grafana/api-overview
Logs: https://kibana/api-errors?time=now-15m

Likely causes:
- Database connection pool exhaustion
- Downstream service timeout
- Traffic spike

This tells you everything you need to know before you even open your laptop.

3. Use Alert Thresholds Wisely

Thresholds should be based on:

Historical data (what's normal?)
Business impact (when do users notice?)
Trend analysis (is it getting worse?)

Single Threshold (Too Simple):

alert: HighErrorRate
expr: error_rate > 5%

Better: Multiple Thresholds

# Warning: Something's off
- alert: ErrorRateElevated
  expr: error_rate > 1%
  duration: 5m
  severity: warning

# Critical: Users are impacted
- alert: ErrorRateCritical
  expr: error_rate > 5%
  duration: 2m
  severity: critical

4. Implement Alert Suppression

Not every warning needs to wake someone up at 3am.

Severity Levels:

Critical: Page immediately (user-impacting, revenue-affecting)
Warning: Ticket creation, investigate during business hours
Info: Logged, no action required

Time-Based Suppression:

# Only page for critical alerts outside business hours
routes:
  - match:
      severity: warning
    continue: true
    routes:
      - match_re:
          time: "^(09|10|11|12|13|14|15|16|17):.*$"
        receiver: slack
      - match_re:
          time: "^(00|01|02|03|04|05|06|07|08|18|19|20|21|22|23):.*$"
        receiver: ticket_system

Alert Design Patterns

Pattern 1: Rate of Change

Often more useful than absolute values.

# Alert on sudden traffic drop (potential outage)
rate(requests_total[5m]) < 0.5 * rate(requests_total[5m] offset 1h)

Pattern 2: Error Budget Burn Rate

SLO-based alerting catches issues before they violate SLAs.

# Fast burn: consuming 5% error budget in 1 hour
(
  1 - (sum(rate(requests_success[1h])) / sum(rate(requests_total[1h])))
) > (5 * 0.001)  # 0.1% SLO, 5x burn rate

Pattern 3: Predictive Alerts

Alert before the problem becomes critical.

# Disk will be full in 4 hours at current rate
predict_linear(disk_usage[1h], 4*3600) > 0.95

Alert Tuning Process

Step 1: Collect Data

For two weeks, track:

All alerts that fired
Whether they required action
Time to resolve
False positive rate

Step 2: Categorize

Actionable alerts: Real issues that required intervention Non-actionable alerts: Fired but no action taken False positives: Fired incorrectly

Step 3: Tune

High false positive rate (>20%): Increase threshold or duration
Missed incidents: Decrease threshold or add new alert
Noisy but important: Add suppression rules, don't delete

Step 4: Document

Every alert needs a runbook. Template:

# Alert: [Name]

## Meaning
What this alert indicates in plain English.

## Impact
What happens if this alert is ignored.

## Investigation Steps
1. Check dashboard X
2. Review logs for Y
3. Verify Z is responding

## Resolution
- If cause A: do B
- If cause C: do D
- If unknown: escalate to team E

Common Alerting Mistakes

Mistake 1: Alerting on Everything

You don't need an alert for every metric. Focus on:

User-facing symptoms
Capacity/resource exhaustion
Security events

Mistake 2: Stateless Alerts

Alerts should be stateful. If an issue auto-resolves, the alert should auto-clear.

Bad:

alert: DiskAlmostFull
expr: disk_usage > 0.8
# Alert fires continuously until disk drops below 80%

Good:

alert: DiskAlmostFull
expr: disk_usage > 0.8
for: 5m
# Only alerts after 5 minutes, resolves when condition clears

Mistake 3: Alert Flooding

When one component fails, dozens of alerts fire.

Solution: Alert Dependencies

# Only alert on root cause
routes:
  - match:
      service: database
      severity: critical
    group_wait: 30s
    group_interval: 5m
    # Suppress downstream alerts

Alert Testing

Alerts are code. Test them.

Unit Tests

# Test alert fires correctly
rule_files:
  - rules.yml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'error_rate{service="api"}'
        values: '0 0 0 10 10 10'
    alert_rule_test:
      - eval_time: 5m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              severity: critical
              service: api

Chaos Testing

Periodically trigger alerts intentionally:

Verify alert fires
Verify runbook is accurate
Verify escalation path works
Verify resolution procedures

Conclusion

Good alerting is hard. It requires:

Understanding what actually matters to users
Balancing sensitivity with noise
Continuous tuning based on feedback
Clear documentation

But when done right, alerts become trusted signals instead of background noise.

You respond quickly because you trust that every page matters.

That's the goal.

Content Publishing Workflow: How This Writing Section Works

A technical walkthrough of the MDX-based publishing system, RSS feed generation, and automated sitemap updates that power this site's writing section.

6 min

Investigating Memory Leaks: A Systematic Approach

A practical guide to finding and fixing memory leaks in production systems using profiling tools, heap dumps, and strategic instrumentation.

7 min

Building Reliable Systems: Lessons from Production

Key principles and patterns for designing systems that stay running when it matters most.

2 min

Related Posts

Content Publishing Workflow: How This Writing Section Works

Investigating Memory Leaks: A Systematic Approach

Building Reliable Systems: Lessons from Production