Session 9.3: Data Analytics Architectures

Course → Module 9: Advanced Topics & Emerging Architectures

The Problem With Data at Scale

When a system is small, one database handles everything: transactions, queries, reports, analytics. As the system grows, these workloads conflict. A complex analytics query that scans millions of rows locks resources needed by transactional writes. Real-time dashboards demand low latency. Monthly reports demand completeness. Trying to serve both from the same store is a recipe for a system that does neither well.

Data analytics architectures solve this by separating concerns. Different layers handle different workloads. The question is how to separate them and how many layers you actually need.

Lambda Architecture

Nathan Marz introduced Lambda architecture around 2011 to solve a specific problem: how do you get both accuracy (batch processing over complete datasets) and low latency (real-time processing over recent data)?

The answer: run both, in parallel.

Lambda architecture has three layers. The batch layer processes the complete dataset periodically (hourly, daily) using tools like Apache Spark or Hadoop MapReduce. It produces accurate, comprehensive views. The speed layer processes incoming data in real time using tools like Apache Flink or Kafka Streams. It produces approximate, low-latency views. The serving layer merges results from both layers and serves queries.

graph LR DS[Data Sources] --> IL[Ingestion Layer] IL --> BL[Batch Layer
Spark / Hadoop] IL --> SL[Speed Layer
Flink / Kafka Streams] BL --> SV[Serving Layer
Merged Views] SL --> SV SV --> Q[Queries & Dashboards] DS --> MS[(Master Dataset
Immutable Log)] MS --> BL style DS fill:#c8a882,stroke:#111110,color:#111110 style IL fill:#8a8478,stroke:#111110,color:#ede9e3 style BL fill:#6b8f71,stroke:#111110,color:#111110 style SL fill:#c47a5a,stroke:#111110,color:#111110 style SV fill:#ede9e3,stroke:#111110,color:#111110 style Q fill:#c8a882,stroke:#111110,color:#111110 style MS fill:#8a8478,stroke:#111110,color:#ede9e3

The batch layer is the source of truth. It recomputes views from the complete dataset. The speed layer compensates for the batch layer's latency by processing only the data that has arrived since the last batch run. When a new batch completes, the speed layer's data for that period is discarded.

The strength of Lambda is correctness with low latency. The weakness is complexity. You maintain two separate processing codebases (batch and streaming) that must produce compatible results. When they disagree, debugging is painful.

Kappa Architecture

Jay Kreps proposed Kappa architecture in 2014 as a simplification. His argument: if your streaming layer is reliable and replayable enough, you do not need a separate batch layer. Just process everything as a stream.

graph LR DS2[Data Sources] --> LOG[(Immutable Log
Kafka)] LOG --> SP[Stream Processor
Flink / Kafka Streams] SP --> SV2[Serving Layer
Views & Indexes] SV2 --> Q2[Queries & Dashboards] LOG -->|Replay for
reprocessing| SP style DS2 fill:#c8a882,stroke:#111110,color:#111110 style LOG fill:#6b8f71,stroke:#111110,color:#111110 style SP fill:#c47a5a,stroke:#111110,color:#111110 style SV2 fill:#ede9e3,stroke:#111110,color:#111110 style Q2 fill:#c8a882,stroke:#111110,color:#111110

In Kappa, all data flows through an immutable, replayable log (typically Apache Kafka). A single stream processing engine reads from the log and builds serving views. When you need to reprocess historical data (because your logic changed or you found a bug), you replay the log through an updated processor and swap in the new views.

Lambda architecture is the answer when you cannot choose between batch and real-time. Kappa is the answer when you realize you should not have to.

Data Lakehouse

Data lakes store raw data cheaply at massive scale (S3, ADLS, GCS) but lack the transactional guarantees and query performance of data warehouses. Data warehouses provide structured, fast querying but are expensive and rigid. The data lakehouse combines both.

Technologies like Delta Lake (Databricks), Apache Iceberg, and Apache Hudi add transactional capabilities (ACID transactions, schema enforcement, time travel) directly on top of object storage. You get the cost and flexibility of a data lake with the structure and reliability of a warehouse. No need to copy data between systems.

The lakehouse pattern has gained significant traction since 2022 because it eliminates one of the most painful aspects of data architecture: the ETL pipeline from lake to warehouse. When your lake is your warehouse, that pipeline disappears.

Architecture Comparison

Dimension	Lambda	Kappa	Lakehouse
Processing model	Batch + streaming (dual)	Streaming only (unified)	Batch + streaming on unified storage
Codebase	Two (batch and streaming)	One (streaming)	One or two (depends on tooling)
Reprocessing	Batch layer re-runs	Replay log through new processor	Time travel and replay
Latency	Low (speed layer) + high (batch layer)	Low (streaming only)	Variable (depends on query engine)
Complexity	High (maintaining two systems)	Medium (single path, replay logic)	Medium (table format management)
Storage cost	High (duplicate data across layers)	Medium (log retention)	Low (object storage pricing)
Best for	Mixed workloads requiring guaranteed accuracy	Streaming-first use cases	Analytics + ML on large datasets
Example tools	Hadoop + Flink + Cassandra	Kafka + Flink + Elasticsearch	Iceberg + Spark + Trino

Column-Oriented Storage

Analytics queries typically scan a few columns across millions of rows. Row-oriented storage (PostgreSQL, MySQL) stores each row contiguously on disk. Reading three columns out of fifty means reading all fifty and discarding forty-seven. Column-oriented storage (Apache Parquet, ORC) stores each column contiguously. Reading three columns means reading only three columns.

The performance difference is dramatic. A query that scans a 1 TB table but only needs two columns might read 40 GB in Parquet versus the full 1 TB in row-oriented format. Column storage also compresses better because values in a single column tend to be similar (all timestamps, all country codes, all prices).

Parquet has become the default analytics format. It supports nested data, predicate pushdown (skip reading data that does not match the filter), and integrates with virtually every analytics tool: Spark, Presto, Trino, Athena, BigQuery.

Choosing an Architecture

The choice depends on your workload profile. If you need both guaranteed accuracy for reports and sub-second dashboards, Lambda handles both but at the cost of maintaining two codebases. If your primary workload is real-time and you can tolerate replay-based reprocessing, Kappa is simpler. If your primary workload is analytics and machine learning over large historical datasets with some real-time ingestion, the lakehouse pattern offers the best cost-to-capability ratio.

Many production systems use hybrids. A Kappa pipeline feeds a lakehouse for long-term storage and ad-hoc analytics. The streaming layer serves real-time dashboards. The lakehouse serves monthly reports and ML training. The boundaries between these patterns are not walls. They are guidelines.

Assignment

A fintech company needs two capabilities:

Real-time fraud detection: Every transaction must be scored within 200ms. The model uses the last 30 days of transaction history for each user.
Monthly compliance reports: Regulators require complete, accurate reports of all transactions, aggregated by category, with no data loss.

Design the data architecture. Answer these questions:

Do you choose Lambda or Kappa? Why?
What specific tools do you select for each layer?
Where does the transaction data live? How long do you retain it?
How do you ensure the monthly report is accurate even if the streaming layer drops events?

Draw the architecture diagram showing data flow from transaction ingestion to both the fraud detection output and the monthly report output.

Data Analytics Architectures