Session 8.10: Generative AI System Design

Course → Module 8: Real-World Case Studies II

Why This System Is Different

Most system design problems covered in this course involve storing and retrieving data. A generative AI system does something fundamentally different: it produces new content. The user sends a question, and the system generates an answer word by word, drawing on a language model with billions of parameters. The infrastructure requirements are unlike anything in traditional web services.

GPU compute is the bottleneck, not network or disk I/O. A single inference request to a 70-billion-parameter model can consume 40 GB of GPU memory and take 2-10 seconds to generate a full response. You cannot scale this the way you scale a REST API. Adding more CPU cores does nothing. You need more GPUs, and GPUs are expensive, scarce, and power-hungry.

The second challenge is cost. An H100 GPU costs roughly $25,000-30,000. A cluster of 64 GPUs for serving a 70B model costs over $1.5 million in hardware alone, not counting electricity, cooling, and networking. Every wasted GPU-second directly impacts the business. This makes batching, queuing, and rate limiting first-class architectural concerns.

Key insight: A generative AI system is a search system with a language model as the last mile. Most of the architecture (retrieval, caching, rate limiting, load balancing) is familiar. The unfamiliar part is the GPU serving layer and its unique constraints.

RAG Architecture

Retrieval-Augmented Generation (RAG) is the dominant pattern for building AI systems that answer questions about specific documents. Instead of fine-tuning a model on your data (expensive, slow, and hard to update), you retrieve relevant documents at query time and include them as context in the prompt. The model generates an answer grounded in those documents.

graph LR Q[User Query] --> EMB[Embedding Model
text-embedding-3-small] EMB --> VS[(Vector Database
Pinecone / pgvector)] VS --> CTX[Top-K Documents
k=5] CTX --> PM[Prompt Assembly
System + Context + Query] PM --> LLM[LLM Inference
GPT-4 / Claude / Llama] LLM --> R[Streamed Response] DOC[Document Corpus] --> CH[Chunker
Split into ~500 token chunks] CH --> EMB2[Embedding Model] EMB2 --> VS

The pipeline has two phases. The indexing phase (offline) splits documents into chunks, computes an embedding vector for each chunk, and stores those vectors in a vector database. The query phase (online) embeds the user's question, searches for the most similar document chunks, assembles a prompt with those chunks as context, and sends it to the language model for generation.

Chunking strategy matters. Chunks that are too small lose context. Chunks that are too large waste the model's limited context window and dilute relevance. A common approach is 300-500 tokens per chunk with 50-100 token overlap between consecutive chunks. The overlap prevents information from being split across a chunk boundary.

Embedding quality matters. The embedding model converts text into a fixed-length vector (typically 768 or 1536 dimensions). Similar texts produce similar vectors. The quality of retrieval depends entirely on the embedding model's ability to capture semantic meaning. State-of-the-art models like OpenAI's text-embedding-3-small or Cohere's embed-v3 produce embeddings in roughly 5-10ms per text chunk.

Vector Database Comparison

The vector database stores document embeddings and performs approximate nearest-neighbor (ANN) search. Several options exist with different tradeoffs.

Database	Type	Max Scale	Index Algorithm	Best For
Pinecone	Managed SaaS	Billions of vectors	Proprietary (HNSW-based)	Fastest time-to-production, fully managed
Weaviate	Open source / Cloud	Hundreds of millions	HNSW + hybrid BM25	Hybrid search (vector + keyword), GraphQL API
pgvector	PostgreSQL extension	Tens of millions	IVFFlat / HNSW	Teams already on PostgreSQL, simpler ops
Qdrant	Open source / Cloud	Hundreds of millions	HNSW with filtering	High recall with complex filters
Milvus	Open source / Cloud	Billions of vectors	Multiple (IVF, HNSW, DiskANN)	Very large scale, cost efficiency
FAISS	Library (not a DB)	Depends on RAM	Multiple (IVF, PQ, HNSW)	Research, prototyping, in-process search

For most production RAG systems, the choice comes down to operational complexity vs. performance. pgvector is the simplest: add an extension to your existing PostgreSQL, no new infrastructure. It works well up to about 5-10 million vectors. Beyond that, Pinecone or Weaviate offer better performance with managed infrastructure. At billion-vector scale, Milvus or Pinecone are the primary options.

High-Level System Architecture

graph TD subgraph Client Layer WEB[Web Chat UI] API_C[API Client] end subgraph API Layer GW[API Gateway
Auth + Rate Limit] QU[Request Queue
Redis / SQS] end subgraph Retrieval Layer EMB_S[Embedding Service] VDB[(Vector Database)] CACHE[(Response Cache
Redis)] end subgraph Inference Layer LB_GPU[GPU Load Balancer] GPU1[GPU Worker 1
vLLM / TGI] GPU2[GPU Worker 2
vLLM / TGI] GPU3[GPU Worker 3
vLLM / TGI] end subgraph Data Layer DOC_S[(Document Store
S3 / PostgreSQL)] IDX[Indexing Pipeline
Chunk + Embed + Store] end WEB --> GW API_C --> GW GW --> CACHE CACHE -->|miss| QU QU --> EMB_S EMB_S --> VDB VDB --> LB_GPU LB_GPU --> GPU1 LB_GPU --> GPU2 LB_GPU --> GPU3 GPU1 --> WEB DOC_S --> IDX IDX --> VDB

The request flow: the API gateway authenticates the request and checks rate limits. It first checks the response cache (identical or near-identical questions often recur in customer support). On a cache miss, the request enters a queue. The embedding service converts the query to a vector, retrieves relevant documents from the vector database, assembles the prompt, and sends it to the GPU inference cluster. The GPU workers stream tokens back to the client as they are generated.

GPU Serving: Batching and Streaming

The GPU inference layer is the most expensive and constrained part of the system. Two techniques make it efficient.

Continuous batching. Naive inference processes one request at a time. While generating tokens for request A, the GPU sits idle between token generations (waiting for the next autoregressive step). Continuous batching (implemented by vLLM and TGI) groups multiple requests into a single batch. When request A finishes, a new request B takes its slot immediately, without waiting for the entire batch to complete. This increases GPU utilization from 30-40% (naive) to 80-90% (continuous batching).

Token streaming. A 500-token response takes 5-10 seconds to generate fully. Making the user wait 10 seconds for the complete response feels slow. Streaming sends each token to the client as it is generated (via Server-Sent Events or WebSocket). The user sees the response appear word by word, which feels much faster even though total generation time is unchanged.

The chart shows the fundamental tradeoff. Larger batch sizes reduce per-request cost dramatically (batch of 8 costs ~20% of a single request). But latency increases because requests in a batch share GPU time. For real-time chatbots, batch sizes of 8-16 offer the best balance. For offline batch processing (summarizing 10,000 documents), batch sizes of 32 or higher are appropriate because latency does not matter.

Rate Limiting for Inference Costs

In a traditional web API, rate limiting protects server capacity. In an LLM-serving system, rate limiting also protects your budget. Each request costs real money: a 1,000-token response from GPT-4 costs roughly $0.03-0.06. At 100,000 requests per day, that is $3,000-6,000 per day in inference costs alone.

Rate limiting strategies for AI systems include: per-user token limits (each user gets 50,000 tokens per day), per-request token caps (maximum 4,000 output tokens per response), priority queuing (paid users get dedicated GPU capacity, free users share a pool), and response caching (identical questions served from cache, zero GPU cost).

The response cache is particularly effective for customer support chatbots. If 100 customers ask "what is your return policy?" in one day, the first query hits the GPU. The remaining 99 are served from cache in under 10ms. Even with fuzzy matching (semantic similarity between cached questions above 0.95), cache hit rates of 30-50% are common for support use cases.

Production Considerations

Hallucination detection. The model can generate plausible-sounding answers that are factually wrong. RAG reduces this by grounding responses in retrieved documents, but does not eliminate it. Production systems add a verification step: check whether the generated answer is actually supported by the retrieved context. If the model claims "our return policy is 60 days" but the retrieved document says "30 days," flag it for human review.

Document freshness. When documents change, their embeddings must be re-computed and updated in the vector database. A stale index means the system retrieves outdated information. The indexing pipeline should run incrementally: detect changed documents, re-chunk, re-embed, and upsert into the vector store. For fast-moving content (product catalogs, pricing pages), this pipeline runs hourly or on every document change.

Observability. Track retrieval quality (are the top-K documents actually relevant?), generation quality (are users satisfied with answers?), latency (time to first token, total generation time), and cost (tokens consumed per request, GPU utilization). Log the full prompt (query + retrieved context) for debugging bad answers.

Assignment

Design a RAG-powered customer support chatbot for an e-commerce company. The company has 5,000 support articles, 200 FAQ pages, and 50 product manuals (totaling roughly 2 million words).

Draw the high-level architecture. Show where documents are stored, how they are indexed (chunked and embedded), and how a user query flows from input to streamed response.
Choose a vector database. Justify your choice based on the document count, expected query volume (1,000 queries per day initially, scaling to 50,000), and team size (3 engineers).
Calculate the embedding storage requirements. If each chunk is ~400 tokens and you use 1536-dimensional embeddings, how many chunks do you expect? How much vector storage is needed?
A user asks: "Can I return a laptop after 45 days?" The retrieved context says the return policy is 30 days for electronics. The model generates: "Yes, you can return the laptop within 45 days." How do you detect and prevent this hallucination?
Design the caching layer. What do you use as the cache key (exact query match, or semantic similarity)? What is the expected cache hit rate for a support chatbot?

Generative AI System Design