Course → Module 2: Scalability, Load Balancing & API Design

Three Models of Compute

When your application needs to scale, the first decision is not "how many servers?" but "what kind of compute unit?" The three dominant models today are virtual machines (VMs), containers, and serverless functions. Each makes a different tradeoff between control, speed, and operational burden.

Understanding these tradeoffs is not optional. Picking the wrong compute model for a workload can cost you 10x more than it should, or leave your system unable to respond to traffic spikes at all.

Virtual Machines

A virtual machine runs a full guest operating system on top of a hypervisor. The hypervisor (VMware ESXi, KVM, Hyper-V, or a cloud provider's custom solution) abstracts the physical hardware and allocates CPU, memory, and storage to each VM.

Because each VM boots a complete OS, provisioning takes minutes. A new EC2 instance on AWS typically needs 30 to 90 seconds to become reachable, and that is after the AMI is already built. Building the image itself can take 10 to 20 minutes.

VM isolation model: Each VM has its own kernel, filesystem, and network stack. This provides strong security boundaries, since a compromised process inside one VM cannot access another VM's memory or filesystem. The cost is resource duplication: every VM runs its own copy of the OS, consuming RAM and disk that the application itself does not need.

VMs shine when you need full OS control, must run legacy software that assumes a dedicated machine, or require strong tenant isolation (as in multi-tenant SaaS where each customer gets their own VM). They are the slowest to scale but the most flexible in what they can run.

Containers

Containers share the host OS kernel. Instead of virtualizing hardware, they use kernel features (namespaces and cgroups on Linux) to isolate processes, filesystems, and network interfaces. A container image contains only the application and its dependencies, not an entire OS.

This makes containers dramatically lighter. A typical container image is 50 to 500 MB, compared to 2 to 20 GB for a VM image. Startup time drops from minutes to seconds, often under 5 seconds for a well-built image.

The tradeoff is weaker isolation. All containers on a host share the same kernel. A kernel vulnerability can potentially affect every container on that host. This matters less in environments where you control all the workloads, and more in multi-tenant platforms where untrusted code runs side by side.

Containers introduce orchestration complexity. Running one container is simple. Running hundreds across a cluster requires a scheduler like Kubernetes, which manages placement, networking, health checks, scaling, and rolling deployments. Kubernetes itself is a significant operational commitment. Most teams underestimate the learning curve and the ongoing maintenance cost.

Serverless Functions

Serverless (AWS Lambda, Google Cloud Functions, Azure Functions) takes the abstraction one step further. You deploy a function. The provider handles everything else: provisioning, scaling, patching, and deprovisioning. You pay only for the time your function executes, measured in milliseconds.

The zero-idle-cost model is the key advantage. If your function receives zero requests, you pay zero. This makes serverless ideal for workloads with unpredictable or spiky traffic patterns.

The primary penalty is cold starts. When a function has not been invoked recently, the provider must allocate a container, load the runtime, load your code, and initialize dependencies before handling the request. This can add 100ms to several seconds of latency, depending on the runtime and package size. Java and .NET functions suffer worse cold starts than Python or Node.js due to larger runtime initialization overhead.

Vendor lock-in: Serverless functions are tightly coupled to the provider's event system, IAM model, and runtime environment. Moving a Lambda function to Google Cloud Functions is not a redeployment. It is a rewrite. This coupling extends to surrounding services: API Gateway, Step Functions, EventBridge, and other provider-specific glue that your function depends on.

The Compute Spectrum

These three models form a spectrum from maximum control to maximum abstraction. As you move from VMs to serverless, you give up control and gain operational simplicity and scaling speed.

graph LR VM["Virtual Machines
Full OS, minutes to start
Maximum control"] --> Container["Containers
Shared kernel, seconds to start
Balanced control"] Container --> Serverless["Serverless Functions
Managed runtime, ms to scale
Minimum control"] style VM fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Container fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Serverless fill:#2a2a2a,stroke:#c8a882,color:#ede9e3

The spectrum is not a ranking. "More abstract" does not mean "better." It means different tradeoffs. The right choice depends on the workload.

Comparison Table

Dimension Virtual Machines Containers Serverless
Startup time 30s to minutes 1 to 10 seconds 100ms to seconds (cold start)
Cost model Pay per hour/reserved, always running Pay for cluster capacity, always running Pay per invocation and duration
Scaling speed Slow (minutes via autoscaling groups) Medium (seconds via HPA/KEDA) Fast (automatic, per-request)
Isolation Strong (separate kernels) Moderate (shared kernel, namespaces) Strong (provider-managed micro-VMs)
Ops complexity High (OS patching, config management) High (Kubernetes, networking, observability) Low (provider manages infrastructure)
Vendor lock-in Low (standard OS images are portable) Low to medium (OCI images are portable, orchestration may not be) High (runtime, triggers, IAM all provider-specific)
Best for Legacy apps, strong isolation needs, long-running processes Microservices, CI/CD pipelines, steady-state APIs Event-driven tasks, variable traffic, glue code

Scaling Behavior in Practice

The difference in scaling behavior becomes clear under sudden load. Suppose your application receives 10x normal traffic for 15 minutes.

With VMs, your autoscaling group detects high CPU, launches new instances, waits for health checks, and registers them with the load balancer. By the time the new capacity is online (3 to 5 minutes), the spike may already be subsiding. You overshoot on the way up and undershoot on the way down.

With containers on Kubernetes, the Horizontal Pod Autoscaler detects high utilization and schedules new pods. If nodes have spare capacity, pods start in seconds. If the cluster needs new nodes, you are back to VM-speed scaling for those nodes. This is why teams pre-provision extra node capacity or use tools like Karpenter to speed up node provisioning.

With serverless, each incoming request can spawn a new execution environment if needed. Scaling is nearly instant and precisely matched to demand. But at scale, you may hit concurrency limits (AWS Lambda defaults to 1,000 concurrent executions per region), and you must explicitly request increases.

The Hybrid Reality

Most production systems use more than one model. A common pattern: containers for the core API (steady traffic, predictable load), serverless for event-driven workers (image processing on upload, webhook handling), and VMs for stateful workloads (databases, legacy systems).

The goal is not to pick one model for everything. It is to match each workload to the compute model whose tradeoffs best fit that workload's characteristics.

Further Reading

Assignment

You are designing infrastructure for three workloads. For each, choose the compute model (VM, container, or serverless) and justify your choice in 2 to 3 sentences. Consider startup time, cost efficiency, scaling needs, and operational constraints.

  1. Always-on REST API with steady traffic. The API serves 500 requests per second consistently, 24/7. Latency must stay under 50ms at p99.
  2. Image thumbnail generation triggered on upload. Users upload between 10 and 10,000 images per hour depending on time of day. Each thumbnail takes 2 to 5 seconds to generate.
  3. ML model serving for product recommendations. The model is 2 GB in memory, requires GPU access, and serves predictions with a 200ms SLA.

For each choice, also identify the main risk of your chosen model and describe one mitigation strategy.