Splunk Behavioral Detection
Below are tangible, actionable examples of how to implement a behavioral/baseline-based detection strategy for identifying potential brute force attempts on VPN portals in Splunk. These approaches often leverage Splunk Enterprise Security (ES) features such as correlation searches, Risk-Based Alerting (RBA), and sometimes the Splunk Machine Learning Toolkit (MLTK) for more sophisticated analysis.
1. Behavior/Baseline Example: Per-User Historical Baseline
Goal: Detect when a specific user’s authentication failures exceed their normal historical pattern.
A. Building a Baseline
-
Aggregate Historical Data
- Create a scheduled search (or a summary index) to calculate the average (and possibly standard deviation) of daily or hourly authentication failures for each user over a historical period (e.g., 30 days).
- Example search to populate a summary index once a day:
- This stores daily average failures (
daily_avg
) and standard deviation (daily_stdev
) per user in a separate index (e.g.,baseline_vpn_fails
).
- Refine the Baseline
- Depending on environment size and activity, you might want to gather more data points (e.g., multiple weeks or months).
- You could also store aggregated metrics by hour of day (peak vs. off hours) if you notice time-of-day usage patterns.
B. Comparing Real-Time Data Against Baseline
- Create a Correlation Search (or Scheduled Search) in Splunk ES:
- Runs periodically (e.g., every 5 minutes) to check real-time failures against the user’s baseline.
- Runs periodically (e.g., every 5 minutes) to check real-time failures against the user’s baseline.
- Set Alert or Create a Notable
- If the user’s 5-minute failure count is greater than (average + 3 × std dev), you can create an alert or a notable event.
- Risk-Based Alerting (RBA)
- Instead of immediately triggering a high-priority alert, you can assign a “risk score” to the user if they exceed their baseline. If that risk score pushes them above a certain threshold within a larger time window, create a high-priority notable in ES.
2. IP Range Baseline (Geo-Based or Network Segment)
Goal: Detect anomalies for specific IP ranges (e.g., a known office network or typical user geo-locations).
- Identify Normal Access Patterns
- For each distinct
src_ip
(or netblock like10.0.0.0/8
), gather historical metrics for the number of authentication failures and successes. - If you use geo-IP data, you can baseline how many attempts typically come from certain countries or regions.
- For each distinct
- Baseline Search
- Similar to the user-based approach, create a summary of average failures from each IP range or geo-location.
- Similar to the user-based approach, create a summary of average failures from each IP range or geo-location.
- Real-Time Comparison
- In a subsequent correlation search, if an IP from a normally quiet location suddenly has a spike in failures, generate a notable.
- In a subsequent correlation search, if an IP from a normally quiet location suddenly has a spike in failures, generate a notable.
3. Machine Learning Toolkit Example
Goal: Use a fully data-driven approach to detect outliers in authentication events without manually configuring thresholds.
- Enable the Machine Learning Toolkit (MLTK)
- Install MLTK from Splunkbase if it isn’t already.
- Build a MLTK Experiment
- In Splunk Web, go to Machine Learning Toolkit > Experiments.
- Use the “Detect Numeric Outliers” or “Anomalous Behavior” experiment type.
- Feed it your historical authentication failures, grouped by user/IP over time.
- Train Your Model
- Select your fields (e.g.,
user
,fail_count
) and timeframe to build the training set. - The MLTK will generate a model that flags unusual spikes in
fail_count
per user.
- Select your fields (e.g.,
-
Apply the Model in Real-Time
- After training, you can schedule a search that uses the
apply
command to run the model on new data and look for outliers. - Example:
- If
isOutlier = 1
, it means the count of failures for that user is anomalous relative to historical patterns.
- After training, you can schedule a search that uses the
- Generate Alerts / Notables
- Convert this search into an alert or correlation search in Splunk ES.
- Add risk-based logic: for each outlier event, add X points of risk to that user. If the user’s risk exceeds a certain threshold, generate a notable event.
4. Risk-Based Alerting (RBA) Workflows in Splunk ES
- Create a Risk Object
- In Splunk ES, each user or IP can be a “risk object.”
- For example, if a user’s current login failure is above baseline, add 20 risk points to that user. If it’s significantly above baseline (e.g., 5x normal), add 50 points.
- Use a Risk Threshold
- A separate correlation search monitors risk scores across all risk objects.
- If a user’s risk score > 100 in the last 24 hours, create a notable event: “Possible Compromise of [User].”
- Advantages
- Single brute force spike might not trigger a critical alert if it’s borderline.
- Multiple suspicious events (e.g., large spike in failures + logins from unusual country) quickly elevate the user’s risk, triggering a more severe notable.
5. Actionable Next Steps
- Start Simple
- Begin by collecting historical data and building summary indexes for daily or weekly authentication failures per user/IP.
- Set Up a Scheduled Search or Correlation Search
- Compare real-time data to your baselines.
- Generate notables if significantly above normal levels.
- Iterate and Refine
- Add contextual data (GeoIP, user groups, known business hours) to reduce false positives.
- Move from static thresholds (e.g., average + 3×stdev) to a machine learning model if you have enough data and want more precision.
- Adopt RBA
- If you have Splunk Enterprise Security, leverage risk-based alerting to avoid “alert fatigue.”
- Combine multiple suspicious behaviors into a single high-severity notable.
Conclusion
By comparing current user/IP activity to a known historical baseline, you gain stronger detection coverage for brute force attempts than simple threshold-based rules. Splunk offers multiple ways to achieve this, from correlation searches in Enterprise Security to fully data-driven anomaly detection using the Machine Learning Toolkit. Over time, you can integrate these detections with Risk-Based Alerting to prioritize your most critical security events.