Blog

How to Turn Security Pipelines Into Gold Mines

[email protected]

Jun 13, 2024

13 min read

TL;DR

This is part 2 of our 3-part series on managing pipelines and infrastructure for high-fidelity detections. Part 1 outlined strategic considerations, and this post drills into functional requirements. Part 3 will wrap up with common pitfalls and how to avoid them.

Pipeline Ins and Outs

So you’ve considered cost, priority logs, data formats, volumes, and time frames. Now it’s time to evaluate specific requirements. Using pipeline functionality to streamline SecOps is a great example of the “shift left” engineering philosophy at work. And while routing, filtering, and transformation seem basic, these challenges have dogged security teams for two decades.

“One of the most notorious and painful problems that has amazing staying power is of course that of data collection.”
– Anton Chuvakin, leading security and log management expert, from “20 Years of SIEM”

So where to start? How can you filter noisy components from critical high volume logs to reduce data volume and ensure data quality? What routing functionality do you need to move different logs to appropriate locations for processing and storage? What formats are your logs in–standardized, lightweight formats like JSON, or nonstandard formats from custom systems?

Maybe your use cases require a combination of filtering, routing, and transformation capabilities. How do they need to interact to achieve your goals? Developing evaluation criteria around these questions will narrow down which vendors and technical approaches can power your strategy. As that evaluation comes together, you’ll want to build specific tactical processes into your plan.

Routing The Way to Insight

Routing functionality helps balance performance and cost by allowing you to manage the flow of data so it aligns to more efficient infrastructure.

Part one covered how different time frames require different analysis and storage capabilities. Routing makes these scenarios possible. Sending high value security data to hot storage for immediate analysis ensures rapid detection of emerging threats. Routing moderate value data to warm storage supports correlations to pinpoint more complex attacks. Routing low value data to cold storage ensures you’re prepared to fulfill compliance reporting obligations with minimum costs.

Other routing scenarios depend on the environment and use cases. If you have high priority logs that spike at specific times, routing to different processing clusters can help balance the load and maintain performance. For international organizations, routing different regional data sources to separate destinations enables nuanced analysis based on local threat models. It also streamlines compliance reporting for different jurisdictions.

Panning for Gold–And Filtering the Junk

Filtering is all about dropping the noise and amplifying the signals. Perhaps a more familiar metaphor than gold panning: the proverbial needle in a haystack. How do you make it easier to find that needle? Collect less hay. The benefits of filtering include:

Cost Management: perhaps the most obvious, but it’s worth reiterating: filtering out unnecessary data means reduced infrastructure costs downstream.
Targeted Analysis: by focusing analysis on high-value security data and dropping irrelevant logs, your team can write more targeted and often simpler detection and query logic to pinpoint interesting behaviors.
Reliability: logs can be unpredictable, with fields potentially exceeding per-log volume limits. By checking for common structures before accepting data, you can avoid parsing errors that break analysis logic.
Speed: focused data sets mean faster results for detection rules and queries, which in turn means more agile and effective response and investigation workflows.

Where to Filter?

You have a few options when determining where to filter. Each comes with its own technical implications and pros and cons:

At the Source: many sources let you define rules for when they generate an alert and where it’s sent. Choosing this approach immediately reduces noise in this system. Since it uses resources from the generating system, it decreases your egress costs and can optimize network bandwidth. However, this approach prevents you from applying filter logic uniformly across sources. Also, manually configuring filters one-by-one becomes a heavy operational burden at scale.
In Transit: middleware can centralize and filter logs, with options spanning open source utilities like Fluentd and Vector, commercial software like Cribl, and native SIEM agents from legacy SIEMs like Splunk. This approach offers more centralized control over filters and their logic. However, it adds to network overhead and complexity, potentially impacting performance. Open source options are “free” but still require implementation and maintenance. SIEM agents cost extra and need to be configured and managed for each source, adding to ops overhead.
At the SIEM: most SIEMs offer native filtering with intuitive inclusion and exclusion logic. Managing filtering in the SIEM alongside tasks like log source configuration, detection management, investigations, and reporting further centralizes security administration tasks for more control and scalability. This approach requires little incremental cost over what you’re already paying for the SIEM’s detection, search, and response capabilities.

What to Filter?

Specific requirements for which logs and log components to filter depend on your threat model, but it’s a safe bet to start with high volume logs. Many SIEMs offer canned reports on volume by source, which can help identify logs that would benefit from filtering. Sources like cloud network infrastructure and key management services generate very high volumes, and not all of it is critical.

Some other filtering considerations are listed below:

Development vs. Test vs. Production Environments: disruptions to production environments will be more severe than disruptions to test or development environments. Focusing on production logs while filtering out the latter makes sense in many cases.
Read vs. Write Events: read events may be a good candidate for filtering since they simply indicate access, whereas write events are more directly tied to nefarious activity like data exfiltration. Some systems like AWS CloudTrail have “data events” that represent activity like downloads, transfers, or deletions, which are generally more valuable. Source-level filtering in CloudTrail can determine which events are collected from the API. Middleware and SIEM capabilities offer similar functionality.
Network Traffic: instead of collecting logs on all network traffic flows, you might filter to focus only on traffic that was rejected, as that’s more indicative of an adversary’s attempt at reconnaissance. Use this approach carefully, since an adversary accessing the network could appear to be normal traffic if they successfully exploited something earlier in the attack chain.

Below are more specific examples of logs that are generally OK to filter:

Expected internet web traffic to web apps: routine activity like web browsing doesn’t typically indicate high risk, and other controls (e.g. firewalls, IDS) already monitor it.
Network logs between internal hosts or services: these represent interactions between trusted hosts and services within the network and don’t typically add much value.
DEBUG-level or diagnostic log data from host-based apps: these focus on troubleshooting system performance but rarely contain security-relevant information useful for detections.
Decrypt API Calls: these primarily track legitimate operations, and other security measures (encryption key mgmt, MFA, access controls etc) are typically present to indicate unauthorized access.

The Alchemy of Transformation

Transformation increases the security value of your data by normalizing it into a standard format that aligns with your analysis logic. The result: faster detections and streamlined investigations.

Below is a run down of various transformation processes and how they apply to common security workflows:

Copy functions allow you to transform a field within a nested hierarchy into its own identifier in the schema. Many logs nest important pieces of information deep within a JSON field. A copy transformation can promote that field to an indicator, so you can write simpler detections and queries to identify information in that field.
Rename functions are commonly used to change a field’s name from cryptic terminology to something more concise and widely understood. With clearer and more consistent naming, you’ll be less reliant on tribal knowledge, helping maintain performance if the one person who actually knows what the original field means leaves the team.
Concatenate functions combine multiple fields into one. A common scenario for concatenation is combining multiple individual fields into a key value that serves as a unique identifier for a specific event.
Split functions turn one field into multiple fields. For example, it’s very common for logs to combine the port and IP into a single “address” field. By splitting these apart upfront, your detections can immediately analyze incoming logs for specific ports or IPs. They don’t need logic to separate out that data before analysis. It’s already done.
Label functions add metadata to a specific log event, assigning a key attribute for important context to support downstream processes–for example, labeling events from development vs test vs production instances, to inform the severity and prioritization of resulting alerts.
Mask functions protect sensitive information from exposure within your SIEM. Masking limits the exposure of regulated data within a log from unauthorized access, which may be required for compliance use cases.

That covers the how for transformations. Another consideration is when, which goes back to the previously discussed time frames:

Schema On Write generally makes the most sense for high value security data for immediate detections in hot storage. By parsing and transforming data upfront, it’s already aligned to the logic in your detections and search queries. The drawback with this approach is it requires more upfront processing power.
Schema On Read can make more sense for lower value security data for historical analysis. Since it’s unlikely this data will require analysis for immediate use cases, parsing and normalizing the data upfront may be a wasted use of processing resources.

Optionality between schema on read vs schema on write for different threat scenarios and time frames gives you flexibility to ensure your detections are effective while keeping TCO low.

Running Pipeline Functionality To Ground

Now that we’ve covered detailed tactical considerations, let’s put on our security engineer hat and dig into a couple example scenarios.

Scenario A: Filtering Noise to Stop Classification Errors

Situation: I want to be alerted when an unexpected port opens on a server, as this event could indicate a malware exploit, unauthorized access, or other security threat. Netstat is configured to monitor the server. Fluent bit is collecting, aggregating, and forwarding the logs, which include netstat’s command headers. Noisy data in the headers is resulting in classification failures: the property my Netstat schema expects is an integer value, but instead it receives a string.

Action: The netstat command data is not relevant to my detections, so it’s a perfect candidate for filtering. I set up two Raw Event Filters in order to simply filter out the events that are being captured as addresses by my Schema, before impacting the processing bandwidth of my system.

Result: Classification failures that were originating due to the noisy netstat command headers are now resolved. By filtering out the noise, my detections on unexpected port activity are working as expected, and I’m able to validate that I’ll be alerted in the event of an exploit. I’ve improved the overall data quality in my SIEM.

Scenario B: Transforming Logs for Simpler Detections & Queries

Situation: I’m still monitoring for unexpected ports opening. Inbound logs frequently have an “address” property that includes information for both port and IP address. With this combined field, my detection rules must include logic to separate out the port information, and having to perform this analysis for each detection is beginning to slow down performance.

Action: I write a Starlark script into my schema, that at ingestion time will perform a split transformation of the incoming address property. This separates the combined “address” field into separate “port” and “IP” fields to support more granular analysis.

Result: Now my detection logic can be simplified to just identify whether an open port exists, without extra steps to identify the port attribute within the combined address field. This transformation improves the speed and overall performance of my detections, making it easier to monitor for unexpected port openings. It also streamlines query logic to search for activity related to the port.

Scenario C: Filtering Common Ports to Reduce Costs

Situation: Port 47760 is used by Zeek which is running on my servers in order to monitor them. That port is expected to be open on all my production servers since they have Zeek installed. Due to budget constraints I want to filter them out.

Action: I set up a Normalized Event Filter that excludes log data on port 47760. Since this information is no longer included in ingested data, I’m able to simplify my Detection by removing these two ports from the list of the expected ports in the detection logic.

Result: Filtering out this low-value data has reduced costs by decreasing the overall volume of data being ingested. Furthermore, simplifying the logic in my detections by excluding these common ports helps improve the fidelity and speed of my detections and alerts, making it easier to identify and respond to potential security threats on other, less commonly used ports.

Striking It Rich with Security Insight

Between your priority logs and the combinations of routing, filtering, and transformation capabilities your use cases require, there are practically infinite security pipeline scenarios.

If data is the new gold, striking it rich with security insight requires some mining. So regardless of what your project requires, the pipeline processes and functionality you implement will go a long way in determining the overall success of your SecOps program.

If you’re interested in how Panther can power your high-scale ingestion pipeline workflows with parsing, normalization, filtering, and transformations, request a demo.