Ops Notes

Prometheus Alerting Rules: A No-Bullshit Guide to Configuring Them Right

· InfraOps Router · SRE & Observability
SRE & Observability Visualization

Why Your Alerts Keep Waking You Up at 3 AM

Don’t tell me this hasn’t happened to you. It’s 3 AM. Your phone is blowing up. Someone’s pinging you on Slack every 10 seconds.

You drag yourself to the laptop. CPU load high? Oh, that’s just the nightly batch job. Memory almost full? Right, cache warming.

You bump the threshold up by 10%, close the laptop, and go back to bed.

That’s called alert fatigue. And it’s a cancer on your on-call rotation.

I spent three years making every mistake in the book before I figured out how to write Prometheus alerting rules that actually work. Not the kind you copy-paste from Awesome Prometheus Alert Rules (though that repo is a decent starting point). I mean rules that catch real incidents and shut up when nothing’s wrong.

Here’s everything I learned. The hard way.

The Anatomy of an Alerting Rule

Let’s start with the basics. You’ve seen this before:

groups:
  - name: node-alerts
    rules:
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} CPU usage is high"
          description: "Instance {{ $labels.instance }} has been above 80% for more than 5 minutes (current value: {{ $value }}%)"

Looks simple, right? But here’s the thing — most people treat the for clause like an afterthought.

The for Clause: Most People Get This Wrong

Someone on the FAQ asked: “When configuring Prometheus alerting rules, what is the purpose of the for clause?”

The official answer: It waits for a duration before firing the alert.

But in production, I’ve seen two extremes:

Extreme 1: for: 0s or omitted entirely. Result? A network jitter causes CPU to spike for 15 seconds. Alert fires. Alert resolves. Alertmanager sends 2 notifications. Your phone buzzes twice. You hate your life.

Extreme 2: for: 30m. By the time the alert fires, users have already filed 3 complaints. The incident’s been resolved manually. Your alerting system is useless.

Here’s my rule of thumb:

Metric TypeRecommended for DurationWhy
CPU usage3-5mShort spikes are normal. Sustained high is the real problem
Disk space0-1mFull is full. No waiting needed
Request latency (P99)2-5mAvoid triggering on occasional slow requests
Error rate1-3mDepends on your business tolerance
Node down1-2mWaiting too long impacts MTTR

My philosophy: I’d rather wait 2 extra minutes than get woken up at 3 AM for a false alarm. But for core business metrics, keep for short.

Less Is More: The Art of Rule Design

Last year, our team inherited a system with 400+ alerting rules. Know how many were actually useful? Less than 20.

The rest were noise.

One rule checked: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10. Looked reasonable. Except that node was running Redis — it was designed to use all available memory. That alert fired 3 times a week. Nobody ever acted on it.

First rule of alert design: Every alert must lead to a specific action. If the person receiving it doesn’t know what to do, delete the rule.

Second rule: Don’t create one rule per instance. Aggregate.

Bad:

- alert: HighCpuOnNode1
  expr: node_cpu_...{instance="node1"} > 80
- alert: HighCpuOnNode2
  expr: node_cpu_...{instance="node2"} > 80

Good:

- alert: HighCpuUsage
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80

One rule. Covers all nodes. Done.

Alerting Rules vs Recording Rules: Don’t Mix Them Up

Prometheus has two rule types: alerting rules and recording rules.

Recording rules precompute complex queries and store them as new time series. Your alerting rules can then reference those precomputed metrics.

I’ve seen people shove complex PromQL directly into alerting rules. Prometheus chokes every time it evaluates them. Here’s the right way:

# First, define recording rules
groups:
  - name: node-recording-rules
    rules:
      - record: job:node_cpu_usage:avg_rate5m
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Then, define alerting rules
  - name: node-alerting-rules
    rules:
      - alert: HighCpuUsage
        expr: job:node_cpu_usage:avg_rate5m > 80
        for: 5m

Now Prometheus just reads a precomputed metric instead of running rate() every evaluation cycle. The performance difference is noticeable.

Common Pitfalls in Alert Rule Configuration

Pitfall 1: Label Abuse

Some people put severity: critical on everything. If everything is critical, nothing is.

Here’s my classification:

LevelMeaningResponse Time
criticalService unavailable or data lossWithin 15 minutes
warningDegraded service or riskWithin 1 hour
infoNeeds attention, not urgentNext business day

Pitfall 2: Empty Annotations

This drives me crazy:

annotations:
  summary: "CPU usage is high"

High how? 80%? 95%? The person on-call has to go check Prometheus themselves. Always include {{ $value }}:

annotations:
  summary: "CPU usage is high ({{ $value }}%)"

Pitfall 3: Forgetting Alertmanager

Prometheus generates alerts. Alertmanager sends notifications. I’ve seen people configure perfect alerting rules and wonder why they never got notified.

Alertmanager routing is the glue:

route:
  receiver: 'team-email'
  routes:
    - match:
        severity: critical
      receiver: 'team-pager'
receivers:
  - name: 'team-email'
    email_configs:
      - to: 'team@example.com'
  - name: 'team-pager'
    webhook_configs:
      - url: 'http://pagerduty-webhook'

Production-Ready Alert Configuration

Here’s a config we’ve been running for 8 months. False alarm rate: under 5%.

groups:
  - name: node-alerts
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node {{ $labels.instance }} has been unreachable for more than 2 minutes"

      - alert: HighDiskUsage
        expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk usage > 85% on {{ $labels.instance }} ({{ $labels.mountpoint }})"
          description: "Current value: {{ $value | humanizePercentage }}"

  - name: app-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 5% on {{ $labels.instance }}"
          description: "Current error rate: {{ $value | humanizePercentage }}"

FAQ: Hard Facts for Hard Questions

Q: When configuring Prometheus alerting rules, what is the purpose of the for clause?

A: It makes Prometheus wait a specified duration before firing the alert. This prevents transient spikes from triggering false alarms. Set it to 1-5 minutes depending on your metric’s volatility.

Q: How to setup alerts in Prometheus?

A: Three steps:

  1. Create a rules file (e.g., alerts.yml) with alert definitions
  2. Reference it in Prometheus config under rule_files
  3. Configure Alertmanager with receivers (email, Slack, PagerDuty, etc.)

Q: What is the alert rule in Prometheus?

A: A YAML configuration that defines when to trigger an alert based on a PromQL expression. Each rule has a name, expression, optional for duration, labels, and annotations.

Q: How to create alert rules?

A: Write a .yml file with this structure:

groups:
  - name: example
    rules:
      - alert: AlertName
        expr: promql_expression
        for: duration
        labels:
          severity: level
        annotations:
          summary: "description"

Then load it in your Prometheus config.

Best Practices Summary

PracticeDescriptionPriority
Set for wisely1-5m based on metric volatilityHigh
Use recording rulesPrecompute complex queriesHigh
Keep rules leanOnly alerts that lead to actionHigh
Include current valuesUse {{ $value }} in annotationsMedium
Label severity properlyDifferentiate critical/warning/infoMedium
Prevent alert stormsUse Alertmanager grouping and inhibitionMedium
Review quarterlyRemove stale rulesLow

Final Thoughts

Alerting rules aren’t hard to write. But writing good ones? That takes experience.

The goal isn’t to have the most alerts. It’s to have the right alerts. Alerts that tell you exactly what’s broken and what to do about it.

Start today. Delete the noise. Fix your for clauses. Add {{ $value }} to your annotations.

Your phone — and your sleep schedule — will thank you.