Why Your Alerts Keep Waking You Up at 3 AM
Don’t tell me this hasn’t happened to you. It’s 3 AM. Your phone is blowing up. Someone’s pinging you on Slack every 10 seconds.
You drag yourself to the laptop. CPU load high? Oh, that’s just the nightly batch job. Memory almost full? Right, cache warming.
You bump the threshold up by 10%, close the laptop, and go back to bed.
That’s called alert fatigue. And it’s a cancer on your on-call rotation.
I spent three years making every mistake in the book before I figured out how to write Prometheus alerting rules that actually work. Not the kind you copy-paste from Awesome Prometheus Alert Rules (though that repo is a decent starting point). I mean rules that catch real incidents and shut up when nothing’s wrong.
Here’s everything I learned. The hard way.
The Anatomy of an Alerting Rule
Let’s start with the basics. You’ve seen this before:
groups:
- name: node-alerts
rules:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} CPU usage is high"
description: "Instance {{ $labels.instance }} has been above 80% for more than 5 minutes (current value: {{ $value }}%)"
Looks simple, right? But here’s the thing — most people treat the for clause like an afterthought.
The for Clause: Most People Get This Wrong
Someone on the FAQ asked: “When configuring Prometheus alerting rules, what is the purpose of the for clause?”
The official answer: It waits for a duration before firing the alert.
But in production, I’ve seen two extremes:
Extreme 1: for: 0s or omitted entirely. Result? A network jitter causes CPU to spike for 15 seconds. Alert fires. Alert resolves. Alertmanager sends 2 notifications. Your phone buzzes twice. You hate your life.
Extreme 2: for: 30m. By the time the alert fires, users have already filed 3 complaints. The incident’s been resolved manually. Your alerting system is useless.
Here’s my rule of thumb:
| Metric Type | Recommended for Duration | Why |
|---|---|---|
| CPU usage | 3-5m | Short spikes are normal. Sustained high is the real problem |
| Disk space | 0-1m | Full is full. No waiting needed |
| Request latency (P99) | 2-5m | Avoid triggering on occasional slow requests |
| Error rate | 1-3m | Depends on your business tolerance |
| Node down | 1-2m | Waiting too long impacts MTTR |
My philosophy: I’d rather wait 2 extra minutes than get woken up at 3 AM for a false alarm. But for core business metrics, keep for short.
Less Is More: The Art of Rule Design
Last year, our team inherited a system with 400+ alerting rules. Know how many were actually useful? Less than 20.
The rest were noise.
One rule checked: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10. Looked reasonable. Except that node was running Redis — it was designed to use all available memory. That alert fired 3 times a week. Nobody ever acted on it.
First rule of alert design: Every alert must lead to a specific action. If the person receiving it doesn’t know what to do, delete the rule.
Second rule: Don’t create one rule per instance. Aggregate.
Bad:
- alert: HighCpuOnNode1
expr: node_cpu_...{instance="node1"} > 80
- alert: HighCpuOnNode2
expr: node_cpu_...{instance="node2"} > 80
Good:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
One rule. Covers all nodes. Done.
Alerting Rules vs Recording Rules: Don’t Mix Them Up
Prometheus has two rule types: alerting rules and recording rules.
Recording rules precompute complex queries and store them as new time series. Your alerting rules can then reference those precomputed metrics.
I’ve seen people shove complex PromQL directly into alerting rules. Prometheus chokes every time it evaluates them. Here’s the right way:
# First, define recording rules
groups:
- name: node-recording-rules
rules:
- record: job:node_cpu_usage:avg_rate5m
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Then, define alerting rules
- name: node-alerting-rules
rules:
- alert: HighCpuUsage
expr: job:node_cpu_usage:avg_rate5m > 80
for: 5m
Now Prometheus just reads a precomputed metric instead of running rate() every evaluation cycle. The performance difference is noticeable.
Common Pitfalls in Alert Rule Configuration
Pitfall 1: Label Abuse
Some people put severity: critical on everything. If everything is critical, nothing is.
Here’s my classification:
| Level | Meaning | Response Time |
|---|---|---|
| critical | Service unavailable or data loss | Within 15 minutes |
| warning | Degraded service or risk | Within 1 hour |
| info | Needs attention, not urgent | Next business day |
Pitfall 2: Empty Annotations
This drives me crazy:
annotations:
summary: "CPU usage is high"
High how? 80%? 95%? The person on-call has to go check Prometheus themselves. Always include {{ $value }}:
annotations:
summary: "CPU usage is high ({{ $value }}%)"
Pitfall 3: Forgetting Alertmanager
Prometheus generates alerts. Alertmanager sends notifications. I’ve seen people configure perfect alerting rules and wonder why they never got notified.
Alertmanager routing is the glue:
route:
receiver: 'team-email'
routes:
- match:
severity: critical
receiver: 'team-pager'
receivers:
- name: 'team-email'
email_configs:
- to: 'team@example.com'
- name: 'team-pager'
webhook_configs:
- url: 'http://pagerduty-webhook'
Production-Ready Alert Configuration
Here’s a config we’ve been running for 8 months. False alarm rate: under 5%.
groups:
- name: node-alerts
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "Node {{ $labels.instance }} has been unreachable for more than 2 minutes"
- alert: HighDiskUsage
expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Disk usage > 85% on {{ $labels.instance }} ({{ $labels.mountpoint }})"
description: "Current value: {{ $value | humanizePercentage }}"
- name: app-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
for: 3m
labels:
severity: critical
annotations:
summary: "Error rate > 5% on {{ $labels.instance }}"
description: "Current error rate: {{ $value | humanizePercentage }}"
FAQ: Hard Facts for Hard Questions
Q: When configuring Prometheus alerting rules, what is the purpose of the for clause?
A: It makes Prometheus wait a specified duration before firing the alert. This prevents transient spikes from triggering false alarms. Set it to 1-5 minutes depending on your metric’s volatility.
Q: How to setup alerts in Prometheus?
A: Three steps:
- Create a rules file (e.g.,
alerts.yml) with alert definitions - Reference it in Prometheus config under
rule_files - Configure Alertmanager with receivers (email, Slack, PagerDuty, etc.)
Q: What is the alert rule in Prometheus?
A: A YAML configuration that defines when to trigger an alert based on a PromQL expression. Each rule has a name, expression, optional for duration, labels, and annotations.
Q: How to create alert rules?
A: Write a .yml file with this structure:
groups:
- name: example
rules:
- alert: AlertName
expr: promql_expression
for: duration
labels:
severity: level
annotations:
summary: "description"
Then load it in your Prometheus config.
Best Practices Summary
| Practice | Description | Priority |
|---|---|---|
Set for wisely | 1-5m based on metric volatility | High |
| Use recording rules | Precompute complex queries | High |
| Keep rules lean | Only alerts that lead to action | High |
| Include current values | Use {{ $value }} in annotations | Medium |
| Label severity properly | Differentiate critical/warning/info | Medium |
| Prevent alert storms | Use Alertmanager grouping and inhibition | Medium |
| Review quarterly | Remove stale rules | Low |
Final Thoughts
Alerting rules aren’t hard to write. But writing good ones? That takes experience.
The goal isn’t to have the most alerts. It’s to have the right alerts. Alerts that tell you exactly what’s broken and what to do about it.
Start today. Delete the noise. Fix your for clauses. Add {{ $value }} to your annotations.
Your phone — and your sleep schedule — will thank you.