Performance Optimization
Session 9.4 · ~5 min read
Measure Before You Cut
The most common performance optimization mistake is optimizing the wrong thing. A developer suspects the database is slow, rewrites queries for three days, and discovers the actual bottleneck was a synchronous HTTP call to a third-party API. Another team adds caching everywhere, increasing memory costs by 40%, when the real problem was missing database indexes.
Performance optimization is a discipline. It has an order of operations. Skip a step and you waste time. Follow the order and the problem usually reveals itself quickly.
The first rule of optimization: measure. The second rule: measure again. The third rule: are you sure you measured the right thing?
The Debugging Methodology
Performance debugging follows a systematic flow. Start broad, narrow down, validate, then fix. The following diagram shows the process.
Traces, metrics, logs] D --> B C -->|Yes| E[Classify bottleneck] E --> F{CPU bound?} E --> G{I/O bound?} E --> H{Memory bound?} E --> I{Network bound?} F -->|Yes| F1[Profile code
Optimize algorithms] G -->|Yes| G1[Async I/O
Connection pooling
Batch operations] H -->|Yes| H1[Reduce allocations
Fix leaks
Right-size caches] I -->|Yes| I1[Reduce round trips
Compress payloads
Use CDN] F1 --> J[Validate: Did it improve?] G1 --> J H1 --> J I1 --> J J --> K{Target met?} K -->|No| B K -->|Yes| L[Document and monitor] style A fill:#c8a882,stroke:#111110,color:#111110 style B fill:#6b8f71,stroke:#111110,color:#111110 style E fill:#c47a5a,stroke:#111110,color:#111110 style J fill:#8a8478,stroke:#111110,color:#ede9e3 style L fill:#ede9e3,stroke:#111110,color:#111110
Profile Before Optimizing
Profiling tells you where time is actually spent. Without it, you are guessing. Application profilers (pprof for Go, cProfile for Python, async-profiler for Java) show which functions consume the most CPU time. Distributed tracing tools (Jaeger, Zipkin, Datadog APM) show where time is spent across service boundaries. Database query analyzers (EXPLAIN in PostgreSQL, Query Analyzer in MySQL) show how the database executes your queries.
A flame graph is one of the most useful profiling visualizations. It shows the call stack on the x-axis (wider means more time) and stack depth on the y-axis. A wide bar at the top means that function is consuming significant time. A wide bar deep in the stack means something lower-level is the real culprit.
Profile in production or production-like conditions. Performance characteristics change under load. A query that runs in 5ms with 10 concurrent users might take 500ms with 1,000 concurrent users due to lock contention.
Connection Pooling
Opening a database connection is expensive. The TCP handshake, TLS negotiation (if encrypted), and authentication exchange can take 20 to 100 milliseconds. If every request opens a new connection and closes it when done, those milliseconds add up fast.
Connection pooling maintains a set of pre-established connections that are reused across requests. A request borrows a connection from the pool, uses it, and returns it. No handshake, no negotiation. The connection is already open.
Documented case studies show average response time reductions from 150ms to 12ms after switching from on-demand connections to properly configured pools. Database CPU usage dropped from 80% to 15% in the same study because the server no longer spent most of its time establishing and tearing down connections.
Key pool configuration parameters: minimum and maximum pool size, connection timeout (how long to wait for a connection from the pool), idle timeout (how long unused connections stay alive), and max lifetime (to force recycling of stale connections). Tools like PgBouncer (PostgreSQL), ProxySQL (MySQL), and built-in pool implementations in ORMs (SQLAlchemy, HikariCP) handle this.
Async I/O
Synchronous I/O blocks a thread while waiting for a response. If your web server has 10 threads and each request makes a 200ms database call synchronously, you can handle at most 50 requests per second. The threads are not computing. They are waiting.
Asynchronous I/O releases the thread while waiting. The thread handles other requests until the I/O completes and the original request is resumed. The same 10 threads can now handle hundreds of concurrent requests because they are never idle.
In Python, this means using asyncio and async database drivers (asyncpg, aiomysql). In Java, it means reactive frameworks (Spring WebFlux, Project Reactor) or virtual threads (Java 21+). In Node.js, the event loop is async by default, but blocking the loop with synchronous operations (synchronous file reads, CPU-heavy computations) defeats the model.
Async I/O matters most when your service is I/O-bound: waiting on databases, external APIs, file systems, or message queues. If your service is CPU-bound (image processing, cryptographic operations), async I/O does not help. You need more CPU or more efficient algorithms.
Database Query Plans and Index Coverage
Every SQL database has a query planner. When you submit a query, the planner decides how to execute it: which indexes to use, in what order to join tables, whether to scan sequentially or seek by index. The EXPLAIN command shows the plan.
The most important thing to look for in a query plan is sequential scans on large tables. A sequential scan reads every row. On a 100-million-row table, that can take minutes. An index seek reads only the rows that match, which might be microseconds.
Index coverage means the index contains all columns needed to answer the query without touching the table itself (a "covering index" or "index-only scan"). This eliminates the random I/O of going from the index back to the table for each matching row.
The N+1 Query Problem
The N+1 problem is one of the most common performance killers in applications that use ORMs. It works like this: you load a list of 100 orders (1 query). For each order, you load the customer (100 queries). Total: 101 queries. With 1,000 orders: 1,001 queries. The database round-trip overhead multiplied by N dominates response time.
The fix is eager loading or batch loading. Instead of loading each customer individually, load all customers for all orders in a single query using a JOIN or an IN clause. One query replaces N queries.
Most ORMs support eager loading explicitly. In SQLAlchemy, use joinedload() or subqueryload(). In Django, use select_related() or prefetch_related(). In ActiveRecord, use includes(). The ORM generates the efficient query. But you must ask for it. The default behavior in most ORMs is lazy loading, which produces N+1 queries silently.
Common Bottlenecks Reference
| Bottleneck | Detection Method | Fix |
|---|---|---|
| Missing database index | EXPLAIN shows sequential scan on large table | Add targeted index; consider composite index |
| N+1 queries | Query count per request is unexpectedly high | Eager loading or batch queries |
| No connection pooling | High connection setup time in traces; DB max_connections pressure | Implement connection pool (PgBouncer, HikariCP) |
| Synchronous external API calls | Trace shows long wait time on HTTP calls | Async HTTP client; circuit breaker; timeout |
| Unbounded result sets | Memory spikes; large payload sizes in logs | Pagination; LIMIT clauses; streaming responses |
| Lock contention | Thread dumps show threads waiting on same lock | Reduce critical section; use read/write locks; optimistic concurrency |
| Uncompressed payloads | Network transfer time dominates in traces | Enable gzip/brotli compression; reduce payload size |
| Cold cache | First requests after deploy are slow; cache hit rate near 0% | Cache warming; graceful rollout; pre-populate on deploy |
| Memory leaks | RSS grows over time; GC pauses increase | Heap dump analysis; fix unclosed resources; review object lifecycles |
The Optimization Loop
Performance optimization is iterative. You measure, identify the top bottleneck, fix it, and measure again. The new top bottleneck is often something that was invisible before because the first bottleneck dominated. This continues until you meet your performance target.
A critical mistake is fixing multiple things at once. If you change three things and performance improves, you do not know which change helped. If performance degrades, you do not know which change hurt. Change one thing, measure, validate, then move to the next.
Document every optimization with the before and after measurements. This creates an institutional record of what worked and prevents future developers from reverting optimizations they do not understand.
Further Reading
- Improve Database Performance with Connection Pooling (Stack Overflow Blog)
- Connection Pooling (SQLAlchemy Documentation)
- Use The Index, Luke (Markus Winand)
- Flame Graphs (Brendan Gregg)
Assignment
Your API has a P99 latency of 3 seconds. The target is 500ms. You have access to application logs, database query logs, and distributed tracing. Describe your step-by-step debugging approach:
- What do you measure first? What tools do you use? What are you looking for?
- You discover that 60% of response time is spent in database queries. What do you check next?
- The database queries are fast individually (5ms each), but there are 200 of them per request. What is the problem? How do you fix it?
- After fixing the query problem, P99 drops to 800ms. Still above target. The remaining time is split between a synchronous call to a payment API (300ms) and response serialization (200ms). How do you address each?
- What monitoring do you put in place to prevent regression?
For each step, state the specific metric you would check, the tool you would use, and the expected outcome of your fix.