Session 3.11: Big Data Fundamentals

Course → Module 3: Storage, Databases & Caching

When "Big" Stopped Being an Adjective

For most of this course, we have discussed systems that handle thousands or millions of requests. Big data is what happens when the data itself becomes the bottleneck, not because the queries are complex, but because there is simply too much of it for any single machine to store, process, or analyze in a reasonable time.

The global datasphere hit roughly 64 zettabytes in 2020. By 2025, that number reached approximately 181 zettabytes. Projections for 2028 exceed 390 zettabytes. A single zettabyte is one trillion gigabytes. This growth is not linear. It is exponential, driven by IoT sensors, video streaming, mobile devices, and machine-generated logs.

This session introduces the fundamental processing paradigms and storage architectures that handle data at this scale.

OLTP vs. OLAP

Before diving into big data architectures, you need to understand the two fundamental database workload types. Every system leans toward one or the other.

Dimension	OLTP (Online Transaction Processing)	OLAP (Online Analytical Processing)
Purpose	Process individual transactions	Analyze aggregated data
Queries	Simple: INSERT, UPDATE, SELECT by primary key	Complex: JOIN across millions of rows, GROUP BY, aggregations
Data volume per query	A few rows	Millions to billions of rows
Latency requirement	Milliseconds	Seconds to minutes
Storage format	Row-oriented	Column-oriented
Examples	MySQL, PostgreSQL, DynamoDB	BigQuery, Redshift, Snowflake, ClickHouse
Optimization	Indexes on primary/foreign keys	Columnar compression, partitioning, materialized views

OLTP answers "what happened in this transaction." OLAP answers "what happened across all transactions." They optimize for opposite access patterns, which is why forcing one to do the other's job always ends badly.

Batch vs. Stream Processing

Data processing at scale splits into two paradigms based on when the data is processed relative to when it arrives.

Batch processing collects data over a period (hours, a day), then processes it all at once. MapReduce and Apache Spark are the canonical tools. You run a job that reads a full dataset, transforms it, and writes results. The job might take minutes or hours. The output is complete and consistent, but it is always behind real time by at least the batch interval.

Stream processing processes data as it arrives, record by record or in micro-batches of milliseconds. Apache Flink, Kafka Streams, and Apache Spark Structured Streaming are the primary tools. Latency is seconds or less. But handling out-of-order events, late arrivals, and exactly-once semantics is significantly more complex.

Dimension	Batch Processing	Stream Processing
Latency	Minutes to hours	Milliseconds to seconds
Throughput	Very high (processes entire dataset)	High (but per-record overhead)
Data completeness	Full dataset available at processing time	Partial, must handle late arrivals
Complexity	Simpler (no state management)	Higher (windowing, watermarks, state)
Fault tolerance	Rerun the job	Checkpointing, exactly-once semantics
Tools	Spark, Hadoop MapReduce, dbt	Flink, Kafka Streams, Spark Streaming
Best for	Monthly reports, ETL, model training	Real-time dashboards, fraud detection, alerting

Most production systems use both. The Lambda architecture runs a batch layer for accurate historical data and a speed layer for approximate real-time results. The Kappa architecture simplifies this by using a single stream processing layer for everything, replaying the event log when historical reprocessing is needed.

Data Lake vs. Data Warehouse vs. Data Lakehouse

Where you store data at scale depends on its structure, who needs it, and how it will be queried.

A data lake is cheap storage for everything. Raw files in any format (JSON, CSV, Parquet, images, logs) dumped into object storage like S3 or Azure Data Lake Storage. Schema is applied when reading ("schema-on-read"), not when writing. This makes ingestion fast and flexible, but querying is slow without additional tooling. Data lakes tend to become "data swamps" without governance.

A data warehouse is structured storage optimized for analytical queries. Data is cleaned, transformed, and loaded into a predefined schema ("schema-on-write"). Columnar storage formats enable fast aggregations. Snowflake, BigQuery, and Redshift are the dominant players. Warehouses are fast for BI queries but expensive for raw storage and inflexible for unstructured data.

A data lakehouse combines both. It adds a transaction layer (Delta Lake, Apache Iceberg, Apache Hudi) on top of data lake storage. This gives you ACID transactions, schema enforcement, and query optimization on cheap object storage. You get warehouse-like performance without copying data into a separate warehouse system.

Dimension	Data Lake	Data Warehouse	Data Lakehouse
Storage cost	Low (object storage)	High (proprietary format)	Low (object storage)
Data types	Structured, semi-structured, unstructured	Structured only	All types
Schema	Schema-on-read	Schema-on-write	Both (flexible)
ACID transactions	No	Yes	Yes
Query performance	Slow without optimization	Fast (columnar, indexed)	Fast (with table formats)
Governance	Weak ("data swamp" risk)	Strong	Strong
ML/AI support	Good (direct file access)	Poor (data must be exported)	Good (direct access + structure)
Examples	S3 + Athena, ADLS + Synapse	Snowflake, BigQuery, Redshift	Databricks, Delta Lake, Apache Iceberg

Kafka as the Integration Backbone

Apache Kafka sits at the center of most modern data pipelines. It is a distributed event streaming platform that acts as a durable, high-throughput buffer between data producers and consumers. Kafka does not process data. It transports it reliably and lets multiple downstream systems consume the same events independently.

A Kafka topic is an ordered, append-only log of events. Producers write to topics. Consumers read from topics at their own pace. Because the log is persistent (stored on disk with configurable retention), a new consumer can start reading from the beginning. A real-time dashboard can read events as they arrive. A batch pipeline can read the same events hours later. Neither interferes with the other.

This decoupling is why Kafka is the backbone. It separates "when data is produced" from "when and how data is consumed," which lets teams build and evolve their processing pipelines independently.

A Modern Data Pipeline

Putting these pieces together, here is what a modern data architecture looks like for a company that needs both real-time insights and historical analytics.

flowchart LR subgraph Sources App["Application
Events"] DB["Database
CDC"] IoT["IoT
Sensors"] Logs["Server
Logs"] end subgraph Ingestion Kafka["Apache Kafka
(Event Backbone)"] end subgraph Processing Flink["Stream Processor
(Flink / Kafka Streams)"] Spark["Batch Processor
(Spark)"] end subgraph Storage Lake["Data Lakehouse
(Delta Lake / Iceberg)"] WH["Data Warehouse
(Snowflake / BigQuery)"] end subgraph Serving Dash["Real-Time
Dashboards"] BI["BI Reports
(Monthly)"] ML["ML Training
Pipelines"] end App --> Kafka DB --> Kafka IoT --> Kafka Logs --> Kafka Kafka --> Flink Kafka --> Spark Flink --> Lake Flink --> Dash Spark --> Lake Lake --> WH Lake --> ML WH --> BI style App fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style DB fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style IoT fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Logs fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Kafka fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style Flink fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Spark fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Lake fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style WH fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Dash fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style BI fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style ML fill:#2a2a2a,stroke:#c8a882,color:#ede9e3

Data flows left to right. All sources publish to Kafka. Stream processors consume from Kafka for real-time needs. Batch processors consume the same data (or read from the lakehouse) for historical analysis. The lakehouse serves as the central storage layer, feeding both the warehouse for BI and ML pipelines for model training.

The critical insight: Kafka decouples producers from consumers. Adding a new downstream system (a fraud detection engine, a recommendation model, a compliance audit) does not require changing any upstream source. You just add a new consumer group to the relevant Kafka topic.

flowchart TD subgraph OLTP["OLTP Layer"] PG["PostgreSQL
(Transactions)"] end subgraph CDC["Change Data Capture"] Deb["Debezium"] end subgraph Stream["Stream Layer"] KF["Kafka"] FL["Flink"] end subgraph OLAP["OLAP Layer"] LH["Lakehouse
(Iceberg)"] SF["Snowflake"] end PG --> Deb Deb --> KF KF --> FL FL --> LH LH --> SF style PG fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style Deb fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style KF fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style FL fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style LH fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style SF fill:#2a2a2a,stroke:#c8a882,color:#ede9e3

This second diagram shows a common pattern: Change Data Capture (CDC) using Debezium to stream database changes into Kafka, which then feeds into the analytics layer. This lets you keep your OLTP database optimized for transactions while building your OLAP layer from the same data, without any direct coupling between the two systems.

A data pipeline is not a single tool. It is a system of decoupled components, each optimized for one job: ingestion, transport, processing, storage, and serving. Kafka is the connective tissue. The lakehouse is the central nervous system.

Assignment

Your application generates 1 TB of clickstream data per day. The business needs two things: real-time dashboards showing active users and trending pages (updated every 5 seconds), and monthly reports analyzing user behavior patterns over 30-day windows.

Sketch the pipeline. Identify each component: data source, ingestion layer, stream processor, batch processor, storage layers, and serving layers. Name specific technologies for each.
Where does Kafka sit in your design? How many partitions would you estimate for a topic handling 1 TB/day of clickstream events? Show your math (assume average event size of 500 bytes).
Would you use a data lake, a data warehouse, or a lakehouse for long-term storage? Justify your choice in terms of cost, query performance, and support for both real-time and batch workloads.
Draw the OLTP/OLAP boundary. Which parts of your system are transactional? Which are analytical? Where does the handoff happen?

Big Data Fundamentals