The Hard Truth About Correlation Rules
Look, I’ve seen SOC teams treat Splunk ES like a firehose. Configure 200 rules, get 5,000 alerts a day, and call it “security monitoring.” I took over a client’s environment last year — they had 248 correlation searches running, and 70% of their alerts were false positives.
That’s not security. That’s noise.
Today I’m sharing what actually works. Not the vendor docs regurgitated, but the stuff you learn after burning 3 AM weekends on a rule that silently failed for two weeks.
Step 1: Data First, Rules Second
Here’s the mistake everyone makes: they jump straight into Content Management and start writing SPL without checking their data.
I ask three questions before writing a single rule:
- Do your logs cover the MITRE ATT&CK tactics you care about?
- Are your timestamps accurate? (This one will screw you)
- Are your field extractions clean?
The Data Source Check
Go to Settings > Data inputs > Intelligence Downloads in Splunk Enterprise. Filter on mitre. If you haven’t configured threat intelligence feeds here, you’re flying blind.
We run a | datamodel check first. If your data doesn’t pass CIM compliance, stop everything and fix that. Rules built on bad data are worse than no rules — they give you false confidence.
Step 2: Writing Your First Correlation Search
From Splunk ES, hit Configure > Content > Content Management. Filter Type to Correlation Search.
Real Example: Brute Force Detection
We needed to detect SSH brute force. Here’s the rule:
index=linux_secure sourcetype=linux_secure "Failed password"
| bucket span=5m _time
| stats count by src_ip, dest_ip, _time
| where count > 10
| eval severity = if(count > 50, "critical", "high")
| `notable`
Looks clean, right? But here’s the problem: bucket span=5m with a threshold of 10 generated massive false positives in our environment. We switched to span=10m and added a lookup to exclude jump boxes.
Configuration Details
In Content Management, you need to set:
- Cron Schedule: We use
*/5 * * * *for most rules. Real-time search for high-frequency rules - Throttling: This is critical. Without it, one attacker generates 500 alerts. We throttle by
src_ipfor 1 hour - Risk Score: Low=20, Medium=50, High=80, Critical=100. That’s our standard
Step 3: Tuning — The Real Work Begins
Handling False Positives
Last year, a rule flagged “DNS tunneling” every 15 minutes. Turned out a dev team was using dig for testing. Three days to find the root cause.
My rule of thumb: never delete a rule, add exceptions.
index=dns sourcetype=dns
| search query_type=TXT AND query_length>500
| where NOT (src_ip IN [subsearch: index=asset_lookup | search category=dev_server | fields ip])
| `notable`
Keeps the detection, kills the noise.
Performance Optimization
Honestly, Splunk correlation search performance can be brutal. One rule on 100GB/day of data brought our search head to its knees.
Here’s what works:
- Use
tstatsinstead ofstats: 10x faster if you’re using data models - Filter early: Put
index=andsourcetype=first in your search - Avoid subsearches: They’re a disaster at scale
| Optimization | Performance Gain | Best For |
|---|---|---|
| Use tstats | 10-20x | Data model ready |
| Early filtering | 3-5x | All scenarios |
| Avoid subsearches | 2-3x | Medium data volume |
| Summary index | 5-10x | Historical analysis |
Step 4: Mapping to MITRE ATT&CK
This is where most teams drop the ball. In Content Management, every Correlation Search can map to MITRE tactics and techniques.
Our rule: every rule maps to at least one MITRE technique. When an alert fires, the analyst knows exactly what phase of the attack they’re looking at.
Our “lateral movement” rule maps to TA0008 (Lateral Movement) and T1021 (Remote Services).
Step 5: Alert Response Automation
Rules fire. Alerts pop. Then what?
We integrated with a SOAR platform. When a rule triggers:
- Auto-create an incident ticket
- Extract key fields (src_ip, dest_ip, user)
- Query threat intelligence feeds
- If high severity, page the on-call engineer
Pro tip: use notable with custom parameters:
| `notable` urgency=high owner=soc_team
Common Pitfalls
Time Window Trap
I’ve seen people set earliest=-30d on a correlation search. Three hours later, it’s still running. Keep time windows under 1 hour unless you have a specific reason not to.
Field Extraction Trap
Using a field in eval that doesn’t exist? Your rule fails silently. Add | fields + src_ip, dest_ip, user, action at the end to validate your fields.
Alert Flood Trap
No throttling configured. One attacker generates 500 alerts. Throttling is not optional.
FAQ
Q: What’s the difference between a correlation search and a regular alert?
A: Correlation searches combine multiple events across time and sources. A regular alert fires on a single event. “10 failed logins in 5 minutes” is correlation. “One failed login” is not.
Q: How do I test a new rule?
A: Validate the logic with | stats in a regular search first. Then deploy to a test Content Management environment. We run every new rule for 24 hours in staging before production.
Q: Too many rules killing performance?
A: Prioritize tstats, kill subsearches, and use summary indexes for historical data. Also, audit your rules quarterly and delete the ones that never fire.
Bottom Line
Configuring Splunk correlation rules isn’t a weekend project. It’s data prep, rule writing, tuning, automation, and iteration. Figure two months minimum to get it right.
But the payoff is real. We dropped false positives from 70% to 15%. Alert handling time went from 45 minutes to 8.
Good rules aren’t written. They’re tuned.