How AI is changing the SOC operating model. Listen now →

close

How AI is changing the SOC operating model. Listen now →

close

BLOG

AI Detection Engineering: How AI Helps Write, Test, and Tune Detection Rules

A new TTP lands in a threat intel feed Monday morning. By Friday, your team has a draft Sigma rule, a unit test suite, two rounds of tuning against staging logs, and a pull request waiting for review. The rule is good. The problem is that twelve more landed this week, and the backlog from last month is still open.

This is the gap detection engineering keeps running into. The detection-as-code foundation (rules in Python or YAML, tested with unit tests, version-controlled in Git, deployed through CI/CD) solves the deployment problem cleanly. It doesn't solve the authoring problem, where 88% of SOC teams saw an increase in alerts in 2025, and 64% still report detection, triage, and investigation as heavily manual work.

AI is starting to compress the slowest parts of that cycle: drafting rules from threat intel, generating tests that catch evasion attempts and false positive storms before deployment, and clustering production false positives so tuning targets the source instead of the symptom. This article covers where AI pulls its weight in each of those phases. It also covers where humans still need to own the call.

Key Takeaways:

  • AI can translate threat intelligence into draft detection rules in Python or YAML, but those rules need human validation against real telemetry before production.

  • Testing benefits from AI at three layers: auto-generated unit tests, adversarial evasion cases, and historical log replay to catch false positive storms before deployment.

  • ML clustering of false positives and AI-suggested filter conditions speed up production tuning, but only when analysts provide structured feedback on why alerts are false positives.

  • The detection-as-code foundation (Git, CI/CD, code-based rules) is the prerequisite; without it, AI-generated rules have no path into production.

What AI detection engineering actually means

AI detection engineering applies LLMs and generative AI to the work of writing, testing, deploying, and maintaining detection rules. You're still authoring Sigma rules, Python functions, KQL queries, and YAML configs, still mapping to MITRE ATT&CK. AI speeds up parts of that cycle: generating rule code from natural language, translating rules across query languages, suggesting tuning adjustments, and building test cases from rule logic.

Earlier ML-in-security approaches focused on anomaly scoring and behavior analysis. Newer generative systems can draft detection logic directly. LLM conversion workflows show how LLM-based workflows can generate and convert detection logic in ways earlier ML systems could not.

Why this is the moment AI starts pulling its weight in detection work

AI use cases in detection engineering have crossed from speculative to practical. We see this in customer workflows daily: rule drafting, test generation, false-positive clustering, and cross-language conversion are all working in production today. The research community has caught up on documenting the same patterns, but the shift is already happening on the ground.

Detection-as-code workflows also align with broader operational pressure. The average organization now runs 45 cybersecurity tools. Maintaining rules across all of them by hand stopped scaling.

Writing detection rules with AI

AI helps most when you need to get from an idea to a draft rule quickly. The question is where that speed actually shows up. The next three workflows break rule creation into drafting from threat hypotheses, generating code from natural language, and converting logic across platforms.

1. Translating threat hypotheses into detection logic

Fine-tuned LLMs can turn threat intelligence into draft detection logic faster than manual authoring alone. A common pattern shows up in practice: AI performs best when grounded in concrete schemas, examples, and validated rule structures. It's less reliable when asked to produce generic production-ready logic without that context.

Common failure modes include incorrect log source, wrong EventID usage, overly generic rules, and incorrect ATT&CK references. As James Nettesheim, CISO at Block, puts it, "We still want a human in the loop overall. We're extremely bullish on adopting agentic coding and analysis."

One published pipeline shows the approach end-to-end: ingest a threat intelligence blog, extract attack patterns, map them to MITRE ATT&CK techniques, and generate Sigma rules in YAML with correct ATT&CK tags. The key design choice is fine-tuning on real-world Sigma rules rather than prompting a generic model, which directly addresses failure modes like incorrect log sources and wrong EventIDs.

The practical appeal of this workflow is the reduction in analyst effort it produces when turning a security blog into a batch of Sigma rules.

2. Generating rules in Python, SQL, or YAML from natural language

Natural-language prompting works best when the model is constrained by your actual rule schema and runtime. When detection rules are written in Python, foundation models are strong at code generation.

Panther AI uses Detection Builder to generate detection code and add test cases from natural language prompts. A documented example shows a prompt specifying the behavior to detect, log source, threshold, time window, exclusions, and severity, producing a deployable Python rule and YAML config. That approach aligns with Panther's public description of its collaboration with Block's security team to use AI and natural-language workflows for creating detections.

Prompt quality matters as much as model quality. The surrounding schema, required methods, and expected attributes determine whether generated code aligns with the rule runtime and available fields. In practice, specifying the log schema is a major quality lever because it tells the LLM which fields the rule() function can actually call.

3. Converting rules across query languages without a full rewrite

Cross-language conversion is more reliable when the LLM handles intent and deterministic tooling handles syntax. One pattern in this space separates intent generation from syntax translation.

SigmAIQ, a pySigma wrapper and LangChain toolkit, takes this approach: the LLM defines the detection intent and pySigma's deterministic conversion backends handle the target-platform syntax. Splitting the two steps reduces hallucinated query structure, which is the main failure mode when asking an LLM to translate query languages directly.

Testing detection rules before they ship

Testing tells you whether a rule is safe to deploy, not just whether the logic looks plausible in review. These three layers cover logic correctness, evasion resistance, and production alert volume before rollout.

1. Auto-generating unit tests from the rule itself

AI can generate positive and negative test cases directly from rule logic. Testing closes the gap between "it looks right" and "it works in production." A well-built testing pipeline has three layers, each validating a different property of the rule.

Positive cases confirm the rule fires on malicious samples; negative cases confirm it doesn't fire on benign ones. Panther's MCP integration with Cursor supports detection engineering workflows that can include unit testing, while Cursor rules configuration can encode organizational standards into the AI's behavior. Human effort shifts from test authoring to test validation.

2. Building adversarial test cases that try to break your detection

Adversarial test generation helps you find how a rule can fail before an attacker does. LLMs can generate adversarial log messages designed to evade your detection rules, producing contextually varied evasion attempts tailored to specific threat scenarios.

End-to-end pipeline validation is a useful extension of the same idea. One documented implementation runs adversarial tests against a staging environment, with a correlation layer matching resulting alerts against registered tests using time window, actor identity, and rule ID. The correlation step is what keeps test-generated alerts out of production queues.

3. Replaying historical logs to confirm a rule fires (and doesn't over-fire)

Historical replay is the fastest way to see whether a new rule will flood analysts with false positives in your environment. New rules tested only against synthetic data miss patterns unique to your production environment: scheduled jobs, automation accounts, patch cycles.

Historical replay runs new rules against real production logs to measure alert volume before deployment. Without replay, those same environment-specific patterns can trigger a flood of false positives that buries genuine threats.

Tuning detection rules in production with AI

Production tuning is where AI can save the most analyst time because this is where repeated false positives absorb the most effort. The next three sections cover the tuning loop: grouping repeated false positives, proposing suppressions or filters, and learning from analyst feedback.

False positives still dominate production detection work. Recent survey data puts the figure at 73%, and customer data from our own platform shows the same pattern. Docker cut false positives by 85% and Snyk by 70% after moving to detection-as-code with AI-assisted tuning.

1. Clustering false positives to find the real source

Clustering makes repeated false positives easier to tune systematically. False positive clustering groups false positive alerts by what they share, so you spot categories of repeated false positives instead of investigating each alert one by one.

A specific subnet, service account, or scheduled task generating the same alert class repeatedly appears as a dense cluster rather than undifferentiated volume, which lets you address the source at the rule or filter level instead of handling alerts one at a time.

2. Suggesting filter conditions and suppression logic

After an analyst finishes investigating, AI agents can suggest rule changes based on what the analyst found. Use tiered suppression: reduce alert severity rather than eliminating alerts entirely, and review samples of suppressed alerts on a defined schedule.

Suppression that incorrectly classifies true positives won't surface through normal analyst workflows.

3. Closing the loop with feedback from analyst dispositions

Structured analyst feedback is what turns tuning from one-off cleanup into a repeatable improvement loop. Capturing analyst dispositions and feeding them back into detection tuning measurably reduces false positive rates. One peer-reviewed study of SOCs handling 4,000 alerts daily found that adding a feedback loop cut investigation time from 12.5 to 9.8 minutes per alert and dropped the false positive rate from 65% to 50%.

Capturing "closed as false positive" without preserving why produces a lower-quality training signal, so the tooling has to make structured feedback easy.

Where AI fits into a detection-as-code workflow

AI-assisted rule generation only works cleanly in production when your pipeline already treats detection rules like software. This section covers the control points that matter most: Git, pull requests, CI, and the interfaces AI tools use to connect to those systems.

If your rules don't already live in Git, deploy through CI/CD, and exist as code, AI detection engineering has no foundation to build on. Many SOCs adopt AI/ML tools with little or no customization, and the outcomes show it. Integration into existing operations is what separates the teams getting value from the teams getting demos.

As Stephen Gubenia, Head of Detection Engineering for Threat Response at Cisco Meraki, has discussed, successful AI adoption in security operations still depends on strong processes, training, and operational foundations.

Bringing AI into Git, pull requests, and CI

Human approval is the control point that makes AI-assisted rule generation safe in production. The safety pattern showing up across teams is consistent: AI authors, tests, and submits detection rules, but a human approves the pull request before deployment.

That human approval gate is what keeps AI-assisted rule generation compatible with production change control.

Agentic detection engineering with MCP servers

MCP gives AI tools a standard way to connect to the systems detection engineers already use. The Model Context Protocol (or MCP) is an open standard for connecting AI tools to external systems through a single interface, replacing custom API code for each SIEM and Git platform.

An MCP integration lets analysts write detection rules, investigate alerts, and query logs from AI agents, with organizational standards baked in, so generated output conforms to team conventions.

The limits of AI in detection engineering, and where humans stay in the loop

AI-generated detection rules can look structurally valid while embedding incorrect assumptions about attacker behavior or log field semantics that only surface during a real incident. LLMs produce confident but incorrect outputs: fabricated indicators of compromise, incorrect ATT&CK mappings, false correlations between unrelated events.

An AI system has no knowledge of your business cycles, operational patterns, or organizational norms. Service accounts that routinely trigger credential-based detections by design require exception handling that depends on institutional knowledge no LLM can access.

AI performance can be weaker for cloud infrastructure detections, so peer review is especially important for rules targeting cloud infrastructure. The harder failure mode is quiet drift: AI agents that gradually shift what they suppress, deprioritize, or auto-resolve, without an obvious signal to the analyst.

This is a form of automation bias, and in detection work, it's the reason analysts need to audit what AI is suppressing, not just what it escalates. We build Panther AI with this in mind: every enrichment, decision, and suppression action is auditable, and human approval is required for write operations.

The division of labor is straightforward. AI writes first drafts, suggests tuning, and generates tests. Humans validate against real telemetry, own environment-specific exceptions, lead threat hunting for novel TTPs, and approve every production deployment.

Putting AI-assisted detection engineering into practice

AI detection engineering works when you treat AI output like a junior engineer's pull request: useful, often surprisingly good, but requiring review before it touches production. The teams getting real value already have detection-as-code infrastructure and use AI to compress the slowest parts of their workflow, from first-draft rules to unit test generation to false positive clustering, while keeping humans in the approval loop.

Panther AI maps directly to this workflow: AI assistance for writing and tuning detection rules, with human-in-the-loop approval for write operations before changes are applied. If you're looking to compress detection engineering cycles without sacrificing rigor, Panther is worth exploring.

Share:

Bolt-on AI closes alerts. Panther closes the loop.

See how Panther compounds intelligence across the SOC.

Bolt-on AI closes alerts. Panther closes the loop.

See how Panther compounds intelligence across the SOC.

Bolt-on AI closes alerts. Panther closes the loop.

See how Panther compounds intelligence across the SOC.

Bolt-on AI closes alerts. Panther closes the loop.

See how Panther compounds intelligence across the SOC.

Get product updates, webinars, and news

By submitting this form, you acknowledge and agree that Panther will process your personal information in accordance with the Privacy Policy.

Get product updates, webinars, and news

By submitting this form, you acknowledge and agree that Panther will process your personal information in accordance with the Privacy Policy.

Get product updates, webinars, and news

By submitting this form, you acknowledge and agree that Panther will process your personal information in accordance with the Privacy Policy.