Join us on November 21st for
a webinar dedicated to all things PantherFlow!
—
Avoiding the complexities of SQL while retaining its powerful search capabilities is exactly why we created PantherFlow, Panther’s piped query language. Building PantherFlow is a big step for us, so we wanted to spend some time talking about how we got here and why we think PantherFlow is great.
Yes, PantherFlow is indeed another vendor-specific query language, and we’ll explain why it was the best decision to improve Panther’s search experience. We’ll cover our key design goals for developing an intuitive query syntax that simplifies threat investigation, why piped query languages are so effective, and the flexibility of modeling security data as semi-structured JSON. Let’s get into it.
Before PantherFlow there were two ways to interact with your data in Panther: SQL or an interactive Search feature. Each has pros and cons. With SQL you have the raw power to do exactly what you want. However, this comes with little hand-holding. For example, you must provide an ORDER BY
or your data will not be sorted. And without a LIMIT
you could wait minutes for your query to complete. These are only two common examples of SQL’s complexities—there are many more. And of course, you actually need to know how to write SQL, which is not a common skill in the security space.
With our Search feature you get a lot of bells and whistles:
However, there is one major drawback: you can only perform basic filtering operations. You cannot transform, aggregate, join, or make function calls.
This created a gap for users who found SQL too cumbersome, yet considered the Search interface too limiting. The core challenge became: “How can we provide users with the expressive power they desire without inheriting SQL’s complexities?”
Ultimately we decided the best way to resolve this was by embedding our own pipelined query language into Panther’s Search experience. PantherFlow was the result.
Our experience with other query languages showed us that using pipelined query languages results in a more intuitive and approachable experience. In piped search, your data logically flows through the pipeline of operators; the output from one is the input to the next. This allows for complex queries to be built up from short, digestible statements that flow from top to bottom. Reading and writing queries in this way is more accessible to users of all skill levels.
To demonstrate this, we’ll look at the same query written in both PantherFlow and SQL, starting with PantherFlow:
aws_cloudtrail
| where p_event_time > time.ago(1d)
and eventName == 'ConsoleLogin'
and userRole == 'admin'
| extend username = strings.split(email, '@')[0]
| join kind:leftouter user_data=(username_info)
on $left.username == $right.username
| where user_data.job_family == 'executive'
| summarize agg.count() by username
Code language: plaintext (plaintext)
In PantherFlow, each step in the query is applied in order. This makes it simple to understand what the query is doing by breaking it down into discrete, logical steps.
In the above query, the first step references the source of the data: the aws_cloudtrail
table. We then apply some filters with where
to scope the date range and events. We extract the username into a new field using extend
, which we use to join
with some enrichment data. Afterward, we filter our search further by job family. Finally, we reduce the results to a count of actions by username
.
Now let’s look at the SQL for the same query:
SELECT
email.split('@')[0] AS username,
COUNT(*)
FROM aws_cloudtrail
LEFT OUTER JOIN username_info ON username
WHERE p_event_time > current_timestamp - interval '1 day'
AND eventName = 'ConsoleLogin'
AND userRole = 'admin'
AND username_info.job_family = 'executive'
GROUP BY username
Code language: SQL (Structured Query Language) (sql)
With SQL, the query is written in a different order than the steps that are logically applied. To start, the data source is listed in the FROM
clause in the middle of the query. Furthermore, the results of the aggregation are governed by both the SELECT
clause at the top and the GROUP BY
clause at the bottom. Also, the first two conditions in the WHERE
clause are applied before the join, and the final condition is applied after.
The next image visualizes the difference in execution order between PantherFlow and SQL. On the left, each step in the PantherFlow query is colored in the order of the ROYGBIV rainbow. On the right, these same colors are applied to the corresponding steps in the SQL query.
Of course, any query language you don’t know is going to be foreign, but what makes PantherFlow and other pipelined languages approachable is the way the query is structured: a logical ordering, starting with the data source and applying operators like filters, joins, and transformations as sequential building blocks. This way of structuring queries makes them intuitive to write and easy to read.
We built PantherFlow to wrap our existing SQL data lake, but one of our key design goals was to allow users to easily query multiple tables, even those with different structures. In SQL this is typically quite a challenge, but in PantherFlow the union
operator lets you query many tables at once simply by listing the names of the tables, even those with wildly different schemas.
To make this work, PantherFlow presents a semi-structured data model. In contrast to a SQL query—where data is retrieved from structured tables and returned as a tabular result—PantherFlow retrieves semi-structured data and returns semi-structured data. As a result, PantherFlow doesn’t require data to have a fixed schema.
This flexibility powers the union
operator mentioned above, but it also maps more naturally to security data which is almost ubiquitously JSON. In fact, in PantherFlow you can even use JSON data as input to your query with the datatable
operator.
As we look to expand PantherFlow to more areas of our product, its semi-structured nature gives us the flexibility to apply PantherFlow to places where schemas may not be practical or available.
We know, PantherFlow is “yet another vendor-specific query language.” There have been many popping up recently including: ES|QL, PQL, and TQL. This begs the question, are pipelined query languages just in vogue or is the industry coming to some kind of shared recognition that they are excellent tools to interact with and explore data? We feel that pipelined query languages’ power, flexibility, and ease of use put them firmly in the latter camp and that they have staying power.
With all these pipelined query languages coming onto the scene, could we have adopted an existing one instead of building our own? We tried but nothing met our use case. We looked into several including:
None of these had everything we were looking for, namely: a Go toolkit for processing queries and generating SQL, support for semi-structured data, and ease-of-use; PQL, which is very similar to PantherFlow, was not available at the time we started development.
That’s not to say we did not take inspiration from these sources. In fact, many of PantherFlow’s operators and overall syntax were heavily inspired by KQL.
Despite all of the effort that goes into designing and developing our own language, building from the ground up allows us to take it in the direction that we think is best for threat hunting and incident response. Let’s walk through just a few examples of how PantherFlow improves search.
When designing PantherFlow we wanted to have one—and only one—obvious way of doing things. For example, unlike some languages, we do not allow for operator aliases, such as using both sort
and order
to do the same thing. While some might argue that aliases give more flexibility to users we believe that this increases the mental overhead when reading unfamiliar code.
It’s really important to us that users have a way to quickly perform text searching across their logs for those times when you just don’t know what field a value might be in. For this use case we created the search
operator which lets you perform a raw text search against logs without specifying an explicit column.
In SQL, null handling is a perennial source of confusion, for example the expression foo = NULL
will always evaluate to False; instead, you must remember to say foo IS NOT NULL
. In contrast, PantherFlow treats null
as just another value that you can compare with, so foo == null
does exactly what you think it should.
Some languages use a mixture of both infix operators (e.g. foo contains 'str'
) and function calls. In our experience, this increases the mental overhead of writing queries because you need to remember whether the thing you are trying to do is a function call or an infix. In PantherFlow, the only infix operations are in
and between
, everything else is a function call. To lower this mental overhead even more, we group PantherFlow’s functions into namespaces like agg
for aggregations and strings
for string functions.
There are many more examples of PantheFlow’s flexibility and expressive syntax. So now that you know why we created our own pipelined search language and some of our major design goals and decisions, continue reading in our next blog How to Write Queries in PantherFlow. We’ll deep dive into the specifics of PantherFlow queries, with plenty of code examples for you to digest!
We’re also hosting a webinar on November 21st to showcase PantherFlow. We’ll cover the what and why of PantherFlow, including a demonstration on how to use it to investigate a threat scenario involving an account compromise, data exfiltration, and privilege escalation. Don’t miss out! Register now.