NEW

Panther joins Databricks to build the future of the security lakehouse. Read more →

Platform

Solutions

Resources

Company

Book a demo

Platform

Solutions

Resources

Company

Book a demo

Panther joins Databricks to build the future of the security lakehouse. Read more →

See all blogs

BLOG

AI Security Operations Center (SOC) Evaluation: 28 Questions to Ask Before You Trust a Vendor

Michelle

Dufty

Jun 14, 2026

Every AI SOC vendor pitch sounds the same. Autonomous triage. Massive noise reduction. Analyst hours returned. The demos are polished, the case studies are hand-picked, and the accuracy numbers come from the vendor's own lab.

Meanwhile, nearly half of all security alerts go uninvestigated, and satisfaction with AI/ML ranks last among SOC technologies. "AI SOC Agents" sit at the Innovation Trigger of the 2025 Hype Cycle for Security Operations, and 40% will be canceled by end of 2027.

No one has built a credible benchmark for AI SOC accuracy in production yet. The evaluation is on you, and the cost of getting it wrong shows up as missed threats, wasted budget, and data locked into a platform you'll spend years trying to leave. As George Werbacher, Head of Security Operations at Live Oak Bank, puts it: "I am a very big advocate for AI, but I think that there is that difference between what's hype and then what's real."

This article gives you 28 questions, organized by domain, that separate what an AI SOC platform actually does from what the demo shows.

Key Takeaways:

AI SOC evaluation spans six domains: data foundations, investigation reasoning, detection workflow fit, accuracy and honest limitations, human oversight, and total cost of ownership. Skip any one, and you have a blind spot.
Transparency, explainability, and interpretability are three different things. Ask which one a vendor actually provides.
Three lock-in clocks tick from day one: your data format, your detection logic, and your behavioral baseline (the model of normal activity the AI has learned for your environment).
Set scoring thresholds and weights before reviewing any vendor responses. Setting them after lets you rationalize a favorite.

Why AI SOC Evaluation Needs a Practitioner's Lens

Teams are buying AI SOC tools faster than they're evaluating them. Satisfaction with AI/ML ranks last among SOC technologies, even as adoption accelerates. An evaluation grounded in your own environment closes that gap before you sign.

Questions About the Data Foundation Underneath the AI

Your data foundation determines whether an AI agent can reason over the right evidence without creating new security and portability risks. Start with what the agent can ingest and how that data is parsed. Then look at what happens to it once you hand it over.

What Log Sources, Coverage, and Schema the Agent Depends On

Start with what the agent can actually ingest today:

Which log sources do you natively support today, not on a roadmap? Require the vendor to distinguish parsers they own from community-contributed ones.
What's your process for adding a new log source, and what's the time or SLA?
Does your AI require data to be copied into your proprietary data lake, or can it operate on data where it resides? Panther takes this approach, keeping data in your own Snowflake or Databricks instance so the AI agent reasons over data you control. That emphasis on local context matches what Alessio Faiella, Director of Security Engineering and Security Operations at ThoughtSpot, says: "For an AI mechanism to help you, it also has to understand your environment."

How Your Data Is Used, Stored, and Protected

Your telemetry is valuable. Know exactly how it's handled:

Is our security telemetry used to train your AI models? If so, is there an opt-out, and is it verifiable? Telemetry used for model training becomes part of a system other customers can influence. If your data trains the model and another tenant's adversarial activity also trains it, the boundary between your security posture and theirs blurs. Insist on a verifiable opt-out, not a checkbox in a settings page.
How is cross-tenant data isolation enforced architecturally? "Logical separation" without documentation or independent testing is a red flag. If they hedge on specifics, that tells you something.
In which geographic regions is our data stored, processed, and used for AI inference? Make sure the answer covers inference, not just where files sit.

Questions About How the Agent Investigates and Reasons

Investigation quality depends on reasoning you can inspect. The first set of questions covers what the agent does during triage. The second covers whether the analyst can actually see and trust that reasoning in the moment.

What the Agent Actually Does During Triage

These questions reveal whether the AI shows its work:

Can you show me a complete audit log of every data source queried for a specific alert, including sources queried where nothing was found? An AI that only surfaces positive findings can hide its misses.
When the AI closes an alert as benign, what specific data points drove that conclusion, and are they visible in the analyst interface? An explanation buried in a side panel or a separate dashboard gets ignored. If an analyst has to leave the alert to find out why the AI made a call, they won't. The reasoning has to live where the work happens.
Does the system explain triage decisions in terms of ATT&CK techniques and observable behaviors, or in terms of model feature weights? "T1059.001 PowerShell execution with encoded command on a server with no prior PowerShell baseline" is actionable. "Feature weight 0.73" isn't.
How does your system distinguish between administrator activity and attacker activity when both match the same ATT&CK technique? Administrator activity is one of the most common sources of false positives. The same PowerShell execution, the same credential dump, the same lateral movement pattern can come from a sysadmin running a Tuesday script or an attacker living off the land. If the AI can't tell them apart using user context, asset criticality, and behavioral baseline, you'll either drown in false positives or miss real attacks.

How Evidence and Reasoning Are Surfaced to Analysts

Analysts need reasoning in context before they can trust or override it:

Does the AI's explanation incorporate asset criticality, user role, and environmental baseline? Context-blind AI produces context-blind explanations.
Can analysts provide feedback on AI decisions, and how is that feedback incorporated? Without a documented feedback loop, the AI won't adapt to your environment.

For example, Cresta's security team cut triage time by at least 50% using Panther, and analysts can review the system's visible evidence chain, including enrichments, detection logic, related alerts, and pivot queries, before any decision is finalized.

Questions About Detection Logic and Workflow Fit

If the AI doesn't fit how your team already writes and ships detection rules, it adds friction instead of leverage. Start with rule compatibility, then look at how AI-generated rules move through the same review and deployment process your humans use.

How the Agent Works With Your Existing Detection Rules

Detection-as-code maturity varies. These questions tell you how well a vendor fits yours:

Does the platform support Sigma rule ingestion natively, or is manual conversion required? Sigma is one open detection format teams may have rules in. Ask which pySigma backends are supported, and whether the platform also supports detection-as-code in Python or YAML.
Can rules be exported back to Sigma format, or are they locked into a proprietary schema?
Can AI-suggested rules enter the existing detection-as-code pipeline (Git PR, CI validation, peer review) without bypassing controls?

How the Agent Fits Into Detection Engineering and CI/CD

Your detection pipeline includes review, maintenance, and debugging, not just deployment:

Which CI/CD systems does the platform integrate with? GitHub Actions, GitLab CI/CD, and Jenkins are common practitioner implementations, and some teams also use Azure DevOps.
Does the platform track rule staleness and support scheduled review workflows? Detection engineering doesn't end at deployment. Rules drift as your environment changes, telemetry shifts, and attacker techniques evolve. A platform that helps you write rules but ignores the maintenance side leaves the hardest work to your team.
Are AI-generated rules distinguishable from human-authored rules in version control? You need to know who, or what, wrote a rule when you're debugging it at 2 AM.

Questions About Accuracy, Tuning, and Honest Limitations

Accuracy claims only matter if you can see how they were measured and where they fail. Two angles matter here: how the vendor reports performance, and how willing they are to document what the system can't do.

How Accuracy Is Measured and Reported

Aggregate accuracy numbers hide important details:

Can you report true positive and false positive accuracy rates separately? A system can achieve high "accuracy" by correctly dismissing benign alerts while missing real threats.
Is a parallel-run mode supported where AI verdicts are compared against analyst conclusions on the same alerts? Run this for 30 to 60 days before you hand over the keys.
How quickly does analyst feedback affect future verdicts on similar alerts?

Docker's security team achieved an 85% false positive reduction while tripling ingestion, but those results are environment-specific. A parallel run in your environment matters.

What the Vendor Admits the Agent Cannot Do

Honest limitations tell you more than polished benchmark slides:

Can you provide a documented case where your system made an incorrect triage decision, and walk through how it was identified and corrected? Any system running at real scale will get things wrong. A vendor who can't show you a failure case is either not measuring or not sharing.
What does the system do when telemetry is incomplete or a required log source is unavailable? Confident verdicts on incomplete evidence produce misses that are harder to catch than an analyst saying "I'm not sure."

Questions About Human Oversight, Guardrails, and Response Actions

Human oversight determines how much autonomy you can safely allow. Set the boundary by what the agent can do without approval, and what it needs sign-off for.

Where Humans Stay in the Loop

Approval boundaries should match the risk of the action. Autonomy should scale with the risk level of each action. As Matt Muller, Field CISO at Tines, says, "AI assisted humans are going to be the ones who are most successful. AI with guard rails is going to be, I think, the path forward for the foreseeable future."

Can approval requirements be configured per action type? Read-only enrichment and account disablement shouldn't have the same approval gate. A well-designed system pauses execution before sensitive actions like updating alert status or creating detection rules, then requires analyst sign-off. Every action gets logged.
Is there a configurable confidence threshold below which the AI must escalate to a human? Ask what the default threshold is, whether you can change it, and what the escalation path looks like.

What Autonomous Actions the Agent Can Take

You need the exact list of unsupervised actions before you decide how much autonomy to allow:

What is the complete list of actions the AI can take without human approval? Get an exhaustive list. If the answer describes categories instead of individual actions, push back.
Which autonomous actions are reversible, and which are not? Disabling an account can be undone. Quarantining a host can be undone. But deleting evidence, sending an external notification, or triggering a downstream remediation in a third-party system often can't be. Any action that creates downstream effects you can't unwind should require human approval by default, with the approval logged for audit.

Questions About Cost, Lock-In, and Long-Term Viability

Long-term cost depends on both pricing mechanics and portability risk. The pricing question shows up on the invoice. The portability question shows up when you try to leave.

What the Total Cost of Ownership Actually Includes

The sticker price rarely tells the full story:

What is the pricing metric, and how does our projected bill change if log volume doubles in 18 months? Cloud log volumes grow faster than budgets. A multi-year contract priced for today's ingestion can double or triple in cost before it expires. Ask specifically whether AI/ML compute costs are included in the platform license or metered separately.

Data Portability and Vendor Lock-In Risks

Portability risk compounds over time, even when the initial deployment looks straightforward.

Beyond the pricing question, probe three portability risks that compound over time:

Data format at exit: Will your data be exported in open formats like Parquet and JSON, or proprietary ones?
Re-baselining cost: What's the timeline to rebuild the AI behavioral baseline if you migrate after two or more years?
Acquisition protection: What contractual protections exist if the vendor is acquired or sunsets the product?

Turning 28 Questions Into a Repeatable Vendor Scorecard

A repeatable scorecard makes sure every vendor gets graded against the same bar. Classify requirements as mandatory or optional, and weight each by organizational priority. Lock in your scoring thresholds and pillar weights before you look at any vendor responses. Use a five-point scale that distinguishes current capability from roadmap claims.

Run a 30-to-60-day proof-of-concept against pre-defined success criteria (MTTD reduction, false positive rate, analyst hours saved) with baselines set before the POC starts. One practitioner-first pattern is an architecture grounded in a security data lake you own, detection rules you can read and version-control, and human-in-the-loop controls you can audit. Panther follows that pattern.

Whether you choose Panther or another platform, these 28 questions help you separate what's real from what's a demo.