WEBINAR

John Hammond + Panther: How agentic workflows are redefining the SOC. Save your seat →

close

John Hammond + Panther: How agentic workflows are redefining the SOC. Save your seat →

close

BLOG

Data Lake Architecture: Components, Design, and Best Practices

The average cloud-native security team pulls logs from ten or more sources. Each source has its own format, its own retention window, and its own query interface. During an investigation, analysts spend more time navigating between consoles than analyzing what actually happened.

A data lake gives you one place to store all of it, raw and queryable, without forcing every log source through a rigid transformation pipeline first. But "dump everything in S3" isn't an architecture. Without the right ingestion, processing, cataloging, and governance layers, you end up with a data swamp that's expensive to store and impossible to search.

This guide covers how data lake architecture actually works: the five core layers, the patterns that map to security workflows, and the design choices that keep your lake queryable as it scales.

Key Takeaways:

  • Data lakes store raw, multi-format data using schema-on-read, while data warehouses enforce structure at ingestion, with different trade-offs for security teams.

  • Five architectural layers form a functional data lake: ingestion, storage, processing, cataloging, and governance. Skipping any one accelerates the slide from data lake to data swamp.

  • The medallion architecture (Bronze, Silver, Gold) maps directly to security workflows: raw evidence preservation, normalized event search, and pre-computed detection tables.

  • Governance, open formats, and compute-storage decoupling are foundational decisions that determine whether your data lake scales predictably or locks you into a single vendor's pricing model.

What Is Data Lake Architecture?

Data lake architecture is the design framework for a centralized repository that stores structured, semi-structured, and unstructured data in its native format using flat object storage. Unlike traditional databases, a data lake applies structure at query time, a pattern called schema-on-read, rather than requiring data to conform to a predefined schema before it's written.

In practice, firewall syslogs, JSON-formatted CloudTrail events, endpoint telemetry, and identity provider logs all flow into a single queryable store without forcing every source through a rigid transformation pipeline first.

How Data Lakes Differ From Data Warehouses

The difference comes down to when structure gets enforced. A data warehouse uses schema-on-write: data must conform to a predefined schema before it's loaded. A data lake defers that structure until query time, storing raw data as-is.

  • Schema-on-write functions as a validation gate: malformed data is rejected before persistence, but you can't ingest a new log source until you've defined its schema.

  • Schema-on-read gives you immediate ingestion flexibility but introduces schema evolution risk.

If a source changes a numeric field to a string without warning, every downstream pipeline that depends on that column breaks. This is one of the most common failure modes in production data lakes, and it's why schema validation at ingestion matters even in a schema-on-read architecture.

Dimension

Data Lake

Data Warehouse

Schema approach

Schema-on-read: applied at query time

Schema-on-write: enforced at ingestion

Data types

Structured, semi-structured, unstructured

Primarily structured/tabular

ACID compliance

Requires open table formats (Delta Lake, Iceberg, Hudi)

Native

Cost structure

Low-cost object storage; compute billed separately

Higher storage cost; scaling increases with volume

Core Components of a Data Lake

A data lake works only when its core layers work together. Skip any one of them, and you'll spend more time fighting your infrastructure than fighting attackers.

1. Data Ingestion Layer

The ingestion layer determines how quickly and reliably data reaches the lake. For security workloads, streaming ingestion is typically non-negotiable because threat detection requires low-latency data availability.

Apache Kafka (or Amazon MSK) is a common backbone for real-time security data. Amazon Kinesis Data Streams provides a serverless alternative with millisecond availability. Apache Flink sits alongside these as a stream processing engine whose checkpointing enables exactly-once processing, helping avoid duplicate events and missed alerts.

2. Storage Layer

Object storage, Amazon S3, Azure ADLS Gen2, or GCS, is the foundation. What matters most here is compute-storage decoupling: storage scales independently of processing power, and multiple engines can query the same data.

Apache Parquet is the standard format for analytical workloads. Columnar storage enables predicate pushdown so queries read only the columns they need, and the format supports column-level encryption.

Open table formats (Apache Iceberg, Delta Lake, Apache Hudi) add ACID transactions, time travel, and schema evolution on top of these files, capabilities that forensic investigations depend on, particularly reconstructing what data looked like at a specific point in time.

3. Processing and Compute Layer

Three engines cover your primary workload patterns:

  • Apache Spark: batch-first with micro-batch streaming; the native runtime for Delta Lake

  • Apache Flink: streaming-first with sub-second event-driven processing; suited for threat detection

  • Trino/Presto: interactive SQL across heterogeneous sources in a single query

4. Data Cataloging and Metadata Management

A data catalog makes your lake searchable. Without one, the lake is just storage. As Spotify's Chris Witter puts it, effective detection and response depends on being able to search and use stored data.

AWS Glue Data Catalog provides schema management with inference, versioning, and fine-grained access control through Lake Formation integration. For classification, Apache Atlas catalogs, classifies, and governs data assets with tags like PII, SENSITIVE, and EXPIRES_ON that automatically propagate through lineage, meaning PII tags follow data through downstream derived datasets.

5. Governance, Security, and Access Control

Because multiple engines can access the same storage, access control must be enforced at the storage layer through IAM policies and Lake Formation permissions, not exclusively at the query engine. Any process with valid object storage credentials can bypass query-layer controls entirely. Governance needs to be part of the design from the start, not a phase-two project.

Common Data Lake Architecture Patterns

Two architecture patterns dominate production deployments. They solve different problems: medallion handles progressive data refinement, while Lambda and Kappa describe how batch and streaming paths are organized.

Medallion Architecture (Bronze, Silver, Gold)

The medallion architecture organizes data into three layers that map directly to security investigative workflows:

  1. Bronze (raw ingestion): Data arrives in its original format with minimal transformation, just metadata columns like _ingest_timestamp, _source_system, and _log_type.

  2. Silver (normalized and validated): Bronze data is deduplicated, validated, and mapped to a common schema such as OCSF, with standardized fields for source/destination IPs, user identity, action, and timestamps. Malformed records get written to a quarantine table rather than dropped. As Christopher Watkins of WP Engine puts it, "The good thing about coming together and having like a unified schema is it works great for normalization."

  3. Gold (detection and reporting): Silver events are enriched with threat intelligence and joined against asset and identity data before being aggregated for detection and alerting.

Lambda and Kappa Architectures

  • Lambda architecture uses two parallel paths: a batch layer for full historical recomputation and a speed layer for low-latency updates. The trade-off is maintaining two codebases doing conceptually similar work.

  • Kappa architecture eliminates the batch layer, treating all data as streams. Historical reprocessing happens by replaying events through updated code.

For security detection, Kappa is often the better fit: detection rules operate identically on live streams or replayed historical data, and Flink's exactly-once semantics eliminate duplicate alert generation.

Lambda is better suited for workloads where full batch recomputation is structurally different, such as ML model training.

How to Design a Data Lake That Doesn't Become a Data Swamp

Three design pillars prevent swamp conditions: decoupled compute and storage, open formats, and governance from day one.

1. Decouple Compute and Storage

Tightly coupled storage and compute, the Hadoop legacy model, makes cost optimization difficult. When compute is expensive to scale, teams defer data quality jobs and governance scans, and governance debt accumulates until the lake is unqueryable.

This played out at Cockroach Labs, where legacy SIEM tooling forced cost-driven compromises and limited log retention to 30 days. After migrating to Panther, they ingested 5x more logs while cutting SecOps costs by over $200K. Decoupling storage from compute removes the constraints that force security teams into coverage gaps.

2. Choose Open Formats to Avoid Vendor Lock-In

Storing data in a proprietary format prevents specialized engines from accessing the same data without expensive ETL copies. Apache Iceberg, Delta Lake, and Apache Hudi use open table formats to separate table logic from the engine that queries it.

For new projects, Apache Iceberg is a strong option for multi-engine deployments. If you're already using Databricks, Delta Lake with UniForm enabled adds Iceberg compatibility.

3. Build Governance In From Day One

Phase

Actions

Day 0–1

Define metadata standards, naming conventions, and ownership model; deploy data catalog; enable audit logging on all storage and compute

Week 1–Month 1

Implement RBAC with least-privilege defaults; configure column-level classification for PII; add data quality validation at ingestion

Quarter 1

Implement lineage tracking across all pipelines; establish and enforce data lifecycle policies

Panther takes this approach with its Security Data Lake, complete data ownership with governance built in from the start.

As Dave Herrald of Databricks puts it, "being in control of your data and direct ownership of your data is a big thing." Detection rules in Python, SQL, or YAML integrate into CI/CD pipelines with version control, so every change to your detection logic is tracked the same way your engineering team tracks code changes.

Data Lake Security and Observability

Encryption, access control, audit logging, and pipeline health need to be built into every layer of your data lake.

  • Encryption at rest is a baseline requirement, not a differentiator. Amazon S3 has enforced default encryption since January 2023, with Azure and GCP offering equivalent defaults and optional customer-managed keys.

  • Access control requires a tiered model. High-privilege accounts need the tightest controls, data engineers get read/write access to specific zones, and analysts get read-only access to approved datasets. AWS Lake Formation enforces this at the table, column, row, and cell level.

  • Audit logging must cover four planes: identity, control, data, and network. S3 Object Lock in WORM mode helps ensure audit records can't be modified during a retention period.

  • Pipeline health is itself a security observable. Throughput spikes suggest exfiltration, drops suggest disruption, and schema drift could signal data injection.

Your Architecture Sets the Ceiling

Your architecture sets the ceiling for every detection rule, investigation, and compliance report built on top of it. Decouple compute from storage so you can scale each independently, choose open table formats so you're not locked into a single engine, and build governance in from day one so your lake stays queryable as it grows.

Panther applies these principles directly: a Snowflake-backed Security Data Lake for complete data ownership, detection-as-code in Python, SQL, or YAML with CI/CD integration, and Panther AI for triage that shows its work and keeps your team in the loop.

If you're rethinking how your data architecture supports SecOps workflows, start with the storage and governance layer — everything else is built on top of those decisions.

Share:

Bolt-on AI closes alerts. Panther closes the loop.

See how Panther compounds intelligence across the SOC.

Bolt-on AI closes alerts. Panther closes the loop.

See how Panther compounds intelligence across the SOC.

Bolt-on AI closes alerts. Panther closes the loop.

See how Panther compounds intelligence across the SOC.

Bolt-on AI closes alerts. Panther closes the loop.

See how Panther compounds intelligence across the SOC.

Get product updates, webinars, and news

By submitting this form, you acknowledge and agree that Panther will process your personal information in accordance with the Privacy Policy.

Get product updates, webinars, and news

By submitting this form, you acknowledge and agree that Panther will process your personal information in accordance with the Privacy Policy.

Get product updates, webinars, and news

By submitting this form, you acknowledge and agree that Panther will process your personal information in accordance with the Privacy Policy.