Session 2.5: Autoscaling

Course → Module 2: Scalability, Load Balancing & API Design

Why Autoscaling Exists

In Session 2.1, we established that horizontal scaling adds more machines to handle more load. But who decides when to add machines, and when to remove them? Manual scaling requires an engineer to watch dashboards and click buttons. That does not work at 3 AM on a Saturday.

Autoscaling automates this decision. It monitors metrics, compares them to thresholds, and adjusts capacity without human intervention. Done well, it keeps costs low during quiet hours and keeps users happy during spikes. Done poorly, it oscillates wildly, scales too late, or never scales down.

Autoscaling is a control system that adjusts compute capacity in response to observed or predicted demand. It is a feedback loop: measure load, compare to target, adjust capacity, re-measure.

Reactive Autoscaling

Reactive autoscaling responds to what has already happened. A metric crosses a threshold, and the system adds or removes instances. There are two common approaches.

Target Tracking

You specify a target value for a metric. The autoscaler continuously adjusts capacity to keep that metric near the target. If you set "average CPU utilization = 50%," the system adds instances when CPU rises above 50% and removes them when it drops below.

This works like a thermostat. You set the temperature you want. The system figures out when to turn the heater on and off. You do not specify the exact conditions for action. You specify the desired outcome.

AWS Auto Scaling supports target tracking on predefined metrics like ASGAverageCPUUtilization, ALBRequestCountPerTarget, and custom CloudWatch metrics.

Step Scaling

Step scaling gives you more control. You define CloudWatch alarms with thresholds, and for each alarm, you specify how many instances to add or remove. The "steps" refer to graduated responses based on how far the metric has breached the threshold.

For example:

CPU 50-70%: add 1 instance
CPU 70-90%: add 3 instances
CPU above 90%: add 5 instances

Step scaling is more predictable than target tracking when you understand your traffic patterns well enough to define appropriate thresholds. But it requires more configuration and more tuning.

Predictive Autoscaling

Reactive autoscaling has a fundamental problem: it responds after load arrives. If your application takes 3 minutes to boot a new instance, that is 3 minutes of degraded service every time a spike hits.

Predictive autoscaling analyzes historical traffic patterns to forecast future load. It pre-provisions capacity before demand arrives. AWS Predictive Scaling, for example, analyzes up to 14 days of historical CloudWatch data to identify recurring patterns. It generates a 48-hour capacity forecast, updated every 6 hours.

The key requirement: your traffic must have recurring, predictable patterns. A news site with unpredictable viral spikes will not benefit much. An enterprise SaaS product where usage peaks at 9 AM on weekdays and drops at 6 PM is a perfect candidate.

Reactive vs. Predictive Autoscaling

Dimension	Reactive	Predictive
Trigger	Current metric breaches threshold	Forecasted metric will breach threshold
Response time	Minutes (metric detection + instance boot)	Proactive (capacity ready before spike)
Best for	Unpredictable, bursty traffic	Recurring, periodic traffic patterns
Configuration	Thresholds and step sizes	Minimal (ML model trains itself)
Risk	Slow response to sudden spikes	Wasted capacity if patterns shift
Cost efficiency	Pay only when needed (with lag)	May over-provision slightly
Can combine?	Yes. Use predictive for baseline, reactive for unexpected surges.

In practice, the best approach is to layer both. Predictive scaling handles the known daily pattern. Reactive scaling catches anything the forecast missed.

Warm-Up Delays: Feedback Loop Delays in Disguise

In Module 0, we learned that delays in feedback loops cause oscillation. Autoscaling is a textbook example.

When the autoscaler launches a new instance, that instance is not immediately useful. It needs to boot the OS, start the application, load configuration, populate caches, and establish database connections. Google's SRE team notes that some of their services are dramatically less efficient when caches are cold, because the majority of requests are normally served from cache.

During this warm-up period, the new instance either handles no traffic or handles it poorly. The autoscaler's metric (say, average CPU) may still look high because the new instance is not yet contributing. So it launches another instance. And another.

This is classic overshoot caused by delay in a feedback loop. The action (launching instances) has been taken, but the effect (reduced load) has not appeared yet. The system keeps acting.

graph LR A[Load increases] --> B[Metric breaches threshold] B --> C[Autoscaler launches instances] C --> D["Warm-up delay (boot, cache fill)"] D --> E[New instances absorb load] E --> F[Metric drops] F --> G[Cooldown prevents premature scale-in] G -->|Load changes again| A

AWS addresses this with an estimated instance warm-up time setting. During this period, the autoscaler does not count the new instance's metrics toward the aggregate. This prevents the overshoot.

Cooldown Periods

Cooldown is the mirror of warm-up. After a scaling action completes, the autoscaler pauses before evaluating metrics again. This prevents rapid oscillation: scale up, metric drops, scale down, metric rises, scale up again.

Without cooldown, you get flapping. The system adds and removes instances in rapid cycles, never reaching a stable state. Each cycle wastes time and money on instance launches that get terminated before they are useful.

The default cooldown in AWS is 300 seconds (5 minutes). Target tracking policies manage cooldown automatically, which is one reason they are often preferred over step scaling for most workloads.

Warm-up prevents overshoot by excluding new instances from metric calculations until they are ready. Cooldown prevents oscillation by pausing scaling decisions after each action. Both are mechanisms to handle delay in the autoscaling feedback loop.

Autoscaling is a Balancing Loop

Recall from Session 0.5 that balancing feedback loops seek equilibrium. Autoscaling is precisely this: a balancing loop that tries to keep a metric (CPU, request count, latency) near a target value. The structure is: gap between actual and desired state triggers corrective action, which closes the gap.

Understanding autoscaling as a feedback loop, rather than just a cloud feature, lets you reason about its behavior. If your system oscillates, look for delays. If it overshoots, increase warm-up time. If it never stabilizes, check whether your metric actually reflects the bottleneck you care about.

Assignment

Your application experiences a 5x traffic spike every day at 9:00 AM. Your reactive autoscaler takes 3 minutes to detect the spike, launch new instances, and complete warm-up. During those 3 minutes, users see errors and degraded performance. This happens every single day.

Identify the type of feedback loop delay causing the problem. Is it detection delay, provisioning delay, warm-up delay, or a combination?
Propose a solution using predictive autoscaling. What data would the predictive model need? How far in advance should it pre-provision?
Propose a hybrid solution that combines predictive and reactive scaling. What does each layer handle?
Could you solve this without predictive scaling at all? Consider scheduled scaling actions. What are the tradeoffs?

Autoscaling