A Security Operations Center (SOC) analyst can meaningfully review about 15 alerts in an eight-hour shift at roughly 30 minutes per investigation. Most teams face far more than that. SOC teams receive an average of 4,484 alerts per day, with 67% going ignored due to alert fatigue and false positive volume.
That gap between what analysts can review and what hits the queue is where real threats hide. Machine learning is one of the few tools that can close part of it, but the category is crowded with inflated benchmarks and marketing that treats "AI-powered" like a complete explanation.
This article breaks down how machine learning actually works in security operations, where it delivers measurable results, where it falls short, and what to look for when evaluating tools that claim ML capabilities.
Key Takeaways:
Machine learning, AI, and LLM-based agents are distinct technologies with different strengths and failure modes; understanding the differences helps you evaluate security tools honestly rather than following buzzwords.
The core applications delivering real value today are anomaly detection and user behavior analytics (UEBA), malware and phishing classification, and AI-assisted alert triage; production deployments that blend supervised and unsupervised ML consistently outperform either approach alone.
Measurable benefits are real but bounded: organizations extensively using security AI and automation resolve breaches 80 days faster and save an average of $1.9 million per breach; however, model drift, explainability gaps, and data quality issues remain persistent constraints.
When evaluating ML-driven security tools, prioritize transparency into how models reach conclusions, human-in-the-loop controls for sensitive actions, and data ingestion quality over raw accuracy benchmarks.
How Machine Learning Is Used in Security
When vendors say "machine learning," they're usually referring to one of several distinct technical approaches, not a single unified capability. That covers models that score and classify security data, language models that help analysts work faster, and agents that can take multi-step action.
Machine learning vs. AI vs. LLM-based agents
Machine learning is the oldest and most operationally mature category. ML models operate on structured data (logs, network flows, process telemetry) and produce outputs like anomaly scores, classification labels, or risk rankings. AI was already embedded across the SOC workflow well before LLMs arrived, with SIEMs using ML for event correlation and anomaly detection.
Large language models (LLMs) provide natural language understanding and generation. In production SOC workflows, LLMs commonly support alert investigation and triage after alerts fire.
LLM-based agents add an execution layer. An agent can call APIs, run scripts, query external tools, and take action based on multi-step reasoning, making decisions, adapting to new information, and executing steps without waiting for human approval at each one. That autonomy is what separates agents from traditional automation, and it introduces both speed benefits and new risk categories that Panther addresses through human-in-the-loop controls.
Types of machine learning used in security
Four ML types show up most frequently in security tooling, each suited to different problems.
Supervised learning trains on labeled examples (known malware, known phishing URLs) and classifies new inputs. Even traditional ML like Random Forest achieves strong results in production classification benchmarks. The tradeoff: supervised models degrade against novel attack variants not represented in training data.
Unsupervised learning finds patterns without pre-tagged examples, building a representation of "normal" and flagging deviations. This is the engine behind behavioral baselining in UEBA and network detection tools. Unsupervised ML has to learn the local context of what is normal for a given environment, which takes time and tuning.
Reinforcement learning trains agents through trial-and-error interaction with an environment. It's more active in red team research than in production SOC tooling today.
Deep learning uses multi-layer neural networks to learn directly from raw data. Autoencoders have become a directly buildable practitioner skill for signature-free anomaly detection across logs and network traffic.
Where Machine Learning Delivers Real Value
The most useful applications reduce analyst workload without hiding too much of the reasoning. Behavior-based detection, content classification, and alert triage are where ML earns its keep.
Anomaly detection and user behavior analytics (UEBA)
UEBA uses machine learning to build behavioral baselines for users and entities, then flags deviations that may indicate compromise. Credential-based attacks leave no malware signature. Compromised credentials were an initial access vector in 22% of confirmed breaches last year, and rule-based SIEM correlation alone can't detect them.
UEBA works best when it blends supervised and unsupervised ML. Pure unsupervised anomaly detection generates excessive alert volumes, while blended approaches reduce analyst workload. At enterprise scale, unusual activity is often part of the normal operating environment, and chasing every anomaly is a waste of time.
Malware, phishing, and spam classification
ML-based classification and signature-based detection each have strengths and limitations, and neither consistently outperforms the other across all scenarios.
For phishing, production-scale results are strong. Logistic regression with TF-IDF has achieved 95.41% accuracy on phishing classification with sub-200ms response time. Google's RETVec vectorizer reportedly improved Gmail spam detection by 38%, including cases of character-level obfuscation.
Alert triage and investigation support
ML-driven triage addresses the structural imbalance between alert volume and investigative capacity. The pipeline typically works in stages: enrich alerts with context, auto-close confirmed false positives, escalate high-confidence cases, and assist investigation on escalated alerts.
The volume reduction from ML-driven triage is substantial when it's done well. Panther customers routinely see it: Snyk reduced alert volume by 70%, Docker cut false positives by 85% year-over-year, and Cresta triages alerts 50% faster with AI.
The strongest real-world numbers come from an AI SOC pilot run with strict guardrails: enforced citations, human approval gates, full audit logging. The results: mean time to discovery improved by 26-36%, MTTR improved by 22%, and false positives dropped by 16 percentage points.
Panther, a cloud-focused security monitoring and AI SOC platform, takes a similar approach with its AI SOC Agent, which triages alerts by reviewing detection logic and per-detection runbooks, gathering evidence across connected data sources, and surfacing a full-context explanation for the analyst to review.
The Benefits Security Teams Actually See
Three operational outcomes matter most: fewer false positives, faster investigations, and broader coverage without matching headcount growth.
Fewer false positives on high-volume log sources
False positive management is expensive: an average annual organizational loss of $1.27 million.
Docker's security team faced this exact challenge with high-volume cloud logs across AWS, GCP, and Azure. Using correlation rules and Python-based detection logic, they achieved an 85% false positive reduction year-over-year while tripling ingestion and maintaining 100% visibility. Analysts stopped chasing false positives and started working the real threats.
Faster investigations and compressed dwell time
Organizations deploying security AI and automation extensively detect and contain incidents 98 days faster than non-adopters.
For individual analysts, the impact is more direct. LLM-assisted triage reduced ticket completion time by approximately 40% on average in peer-reviewed testing.
Coverage that scales without scaling headcount
For teams of 3-10 people, you cannot hire your way to full coverage. Automation lets a small team cover ground that would otherwise require additional hires, and the savings compound over three-year planning cycles. 53% of security teams plan to integrate AI and ML for cloud threat detection, though these capabilities still require staff capable of managing the tooling.
Where Machine Learning Falls Short
Drift, weak explainability, and bad data pipelines are the three failure modes that show up most in production.
Model drift and the ongoing tuning burden
Models trained on last quarter's data may not reflect this quarter's environment. Adversarial perturbations and concept drift are active areas of research in ML-based network intrusion detection. In security, drift carries an adversarial dimension. There's a meaningful difference between noisy data from human error and adversarial concept drift, where attackers actively work to disturb the learning system.
An attacker who understands your retraining cadence can deliberately engineer drift to corrupt the model.
The black-box explainability problem
The models that perform best are often the hardest to explain, and in a SOC, that's disqualifying. Analysts can't trust a flag they can't interrogate, and the decisions they make on those flags can have significant consequences for their organization. ML adoption in SOCs is limited largely by this explainability gap.
An analyst who cannot see why a model flagged a particular event cannot determine whether that flag is trustworthy, especially under time pressure during incident response.
Data quality is the real ceiling
Widespread methodological pitfalls show up even in top-conference ML security papers: sampling bias, label quality issues, and temporal leakage that inflates benchmark accuracy. Security data is inherently imbalanced (legitimate traffic vastly outnumbers malicious events), labeling requires expensive expert judgment, and mislabeled data degrades model training without producing obvious error signals.
As Josh Liburdi, Security Engineer at Brex, says, "Every security team at least every SE Ops Team they really like live or die by their data and the quality of their data."
What to Look For When Evaluating ML-Driven Security Tools
Tool evaluation should focus less on benchmark claims and more on whether the system will hold up in your environment. Three traits matter more than any benchmark: explainability, approval controls, and data quality.
Transparency about how models reach conclusions
You can't trust a flag without seeing what drove it. Panther AI surfaces the full reasoning trail for every alert decision: enrichments, detection logic, pivot queries, and evidence. During a POC with any ML tool, ask the vendor to walk through a specific alert and identify which data features drove the score.
If they cannot demonstrate feature-level attribution (through SHAP, LIME, or an equivalent mechanism), the tool is operationally a black box.
Human-in-the-loop controls for sensitive actions
Automation has to be balanced with human oversight, especially for high-impact or ambiguous security actions. As Stephen Gubenia, Head of Detection Engineering for Threat Response at Cisco Meraki, puts it, "You have to have that human in the loop early and often."
Panther AI enforces this with a tool approval feature that requires explicit user approval before the AI executes sensitive actions like updating alert status or creating detection rules, with all decisions logged in audit trails. Whatever tool you evaluate, ask: which actions are fully automatic, at what confidence thresholds, and can you configure approval requirements per action type?
Data structure and ingestion quality
ML effectiveness is multiplicative with data quality. Good models on bad data produce bad results, which is why Panther's architecture treats ingestion quality, schema fidelity, and data lineage as first-class concerns, not afterthoughts.
During evaluation, ask: what log sources are required versus optional? What happens to detection accuracy when required sources are missing? What data was the model trained on, and can you fine-tune on your own environment?
Making Machine Learning Work Without the Hype
Machine learning delivers genuine results when deployed with clear-eyed expectations. The tools that deliver measurable results share three traits: transparent reasoning, human oversight, and high-quality data pipelines.
Most teams aren't there yet. 40% of SOCs use AI/ML tools without making them a defined part of operations, and 42% rely on out-of-the-box settings with no customization. The gap between "deployed" and "operationalized" is where the real work happens.
For lean security teams, the goal is not to automate everything. The goal is to automate the right things (false positive filtering, alert enrichment, routine triage) so your team can focus on work that requires human judgment: threat hunting, detection engineering, and incident response.
Panther supports that approach through detection-as-code and the AI SOC analyst, which shows its reasoning for analyst review. That's the principle underneath every capability: ML that keeps humans in control of consequential decisions.
See it in action
Most AI closes the alert. Panther closes the loop.

Share:
RESOURCES






