The Complete Guide to System Design: From Basics to Production

Most Systems Are Not Designed. They Are Grown. And the Ones That Were Grown Without Intention Are the Ones That Wake Engineers Up at 3am.

A team at a logistics company inherited a system that had been in production for six years. Nobody on the current team had been there when it was built. The original architects had long since moved on. The system processed 40 million shipment events per day and had a documented uptime of 99.94 percent.

It also had a database schema with 847 columns in a single table. A caching layer that nobody fully understood, implemented by someone who had left three years ago. Six undocumented microservices that communicated through a shared database instead of APIs. A message queue that had been added as an emergency fix during a Black Friday incident four years prior and never properly integrated into the architecture.

The system worked. It worked in the way that a house built without blueprints sometimes works — because enough individual decisions were made correctly in isolation to produce something that stands. But every change was dangerous. Every new engineer took months to become productive. Every incident took twice as long to diagnose as it should have.

This is what systems that are grown rather than designed look like from the inside. They function. They do not scale, adapt, or survive well.

System design is the discipline of making intentional architectural decisions before building — and continuing to make intentional decisions as systems evolve. It is the difference between a codebase that gets easier to work in as it matures and one that gets harder.

This guide covers the complete arc from foundational concepts to production-grade decisions — the vocabulary, the patterns, the trade-offs, and the frameworks that make the difference between systems that are built and systems that are designed.

💡 How to use this guide: This is a reference and a mental model builder, not a tutorial with exercises. Read it sequentially the first time to build the complete mental model. Return to specific sections when you are facing a real design decision and need a framework for thinking it through.

Part 1: The Vocabulary of System Design

Before patterns and trade-offs, the vocabulary. System design discussions fail most often not because the participants lack knowledge but because they are using the same words to mean different things. Precise vocabulary is the foundation.

Scalability

Scalability is a system's ability to handle increased load without degrading the quality of its behavior. Load can mean requests per second, data volume, concurrent users, message throughput, or any other relevant measure of demand.

Vertical scaling means adding more resources to existing machines — more CPU, more RAM, more storage. It is simple to implement, has no architectural implications, and has a hard ceiling defined by the largest machine available. It also creates a single point of failure — when that machine fails, everything on it fails.

Horizontal scaling means adding more machines and distributing load across them. It has no theoretical ceiling and eliminates single points of failure at the compute layer. It introduces complexity: load balancing, data consistency across instances, session management, and distributed coordination all become problems that do not exist with a single machine.

Vertical Scaling:
  Before: 1 server with 8 CPU cores and 32GB RAM
  After:  1 server with 32 CPU cores and 128GB RAM
  Cost:   Simple, but expensive per unit of capacity
  Limit:  Maximum available machine size
  Risk:   Still a single point of failure

Horizontal Scaling:
  Before: 1 server with 8 CPU cores and 32GB RAM
  After:  4 servers each with 8 CPU cores and 32GB RAM
  Cost:   Commodity hardware, cheaper per unit of capacity at scale
  Limit:  Theoretically unlimited
  Risk:   Distributed system complexity introduced

Availability

Availability is the percentage of time a system is operational and accessible. It is expressed as a percentage and translated into acceptable downtime per time period.

Availability   Downtime per Year   Downtime per Month   Downtime per Week
99%            3.65 days           7.31 hours           1.68 hours
99.9%          8.77 hours          43.83 minutes        10.08 minutes
99.99%         52.60 minutes       4.38 minutes         1.01 minutes
99.999%        5.26 minutes        26.30 seconds        6.05 seconds

The difference between 99.9 percent and 99.99 percent availability is 8.25 hours of downtime per year. For a payment processing system handling $10 million in transactions per day, that 8.25 hours represents roughly $3.4 million in transaction volume. The cost of the additional engineering investment required to close that gap is almost always less than the cost of the downtime it prevents.

Reliability

Reliability and availability are related but distinct. Availability measures whether a system is accessible. Reliability measures whether a system produces correct results when it is accessible.

A system can be highly available but unreliable — serving responses quickly but with incorrect data. A caching bug that returns stale data to 100 percent of requests is an availability success and a reliability failure. A distributed data store that partitions and continues serving reads from a lagged replica is available but may be temporarily unreliable.

Consistency

In distributed systems, consistency has a specific meaning: whether all nodes in the system see the same data at the same time. It is one of the three properties in the CAP theorem, which states that a distributed system cannot simultaneously guarantee consistency, availability, and partition tolerance. In the presence of a network partition, the system must choose between consistency and availability.

Strong consistency means every read returns the most recent write. All nodes see the same data simultaneously. This requires coordination between nodes on every write, which adds latency and creates availability risk — if the coordination mechanism fails, writes fail.

Eventual consistency means all nodes will eventually converge to the same state, but may be temporarily out of sync. Reads may return stale data for a brief period after a write. This allows higher availability and lower write latency at the cost of temporary data divergence.

Strong Consistency Example:
  User updates their profile photo at 2:00:00pm
  User refreshes the page at 2:00:01pm
  Guaranteed: The new photo appears — all nodes have the update

Eventual Consistency Example:
  User updates their profile photo at 2:00:00pm
  User refreshes the page at 2:00:01pm
  Possible: The old photo appears for a brief window
  Guaranteed: Eventually all nodes will show the new photo
  Typical convergence time: Milliseconds to seconds

Latency and Throughput

Latency is the time it takes to complete a single operation — from the moment a request is sent to the moment a response is received. It is measured in milliseconds for most web applications and in microseconds for systems like financial trading platforms where latency is a primary competitive metric.

Throughput is the number of operations a system can complete per unit of time — requests per second, messages per second, transactions per second. Throughput and latency are related but not equivalent. A system can have low latency and low throughput (handles individual requests quickly but cannot handle many simultaneously) or high throughput and high latency (handles many requests simultaneously but each takes a long time).

The numbers every system designer should have memorized:

L1 cache reference:          0.5 nanoseconds
L2 cache reference:          7 nanoseconds
Main memory (RAM) access:    100 nanoseconds
SSD random read:             150 microseconds
HDD random read:             10 milliseconds
Network round trip same DC:  500 microseconds
Network round trip US to EU: 150 milliseconds

These numbers explain why caching works so dramatically — the difference between memory access and disk access is a factor of 100,000. They explain why keeping working data sets in RAM is one of the highest-leverage performance optimizations available. And they explain why geographic distribution of data matters for latency-sensitive applications.

Part 2: The Core Components

Every large-scale system is an assembly of a relatively small set of core components. Understanding what each component does, what problems it solves, and what trade-offs it introduces is the foundational knowledge that makes system design tractable.

Load Balancers

A load balancer distributes incoming traffic across multiple backend servers. It is the entry point for horizontal scaling — without a load balancer, horizontal scaling creates a different problem for each client to solve.

Load balancing algorithms:

Round robin distributes requests sequentially across servers. Simple, predictable, and effective when servers are identical and requests have similar processing cost. Fails when servers have different capacities or when requests have wildly different processing costs.

Least connections routes each new request to the server with the fewest active connections. Better than round robin for workloads with variable request processing time. Requires the load balancer to track active connections on each server.

IP hash routes requests from the same client IP to the same server consistently. Creates sticky sessions without application-layer session management. Breaks load distribution when many clients share an IP (corporate proxies, NAT networks).

Weighted assigns higher traffic percentages to servers with more capacity. Effective for heterogeneous server pools or for gradual traffic shifting during deployments.

Layer 4 vs Layer 7 load balancing:

Layer 4 load balancers operate at the transport layer (TCP/UDP). They are fast and have minimal overhead but cannot make routing decisions based on request content — only on IP addresses and ports.

Layer 7 load balancers operate at the application layer (HTTP/HTTPS). They can route based on URL paths, request headers, cookies, and content type. They enable path-based routing, host-based routing, SSL termination, and request manipulation. They have higher overhead than Layer 4 but provide the intelligence required for most modern routing requirements.

Caches

A cache stores the results of expensive operations so they can be served from fast storage rather than recomputed or refetched on every request.

Cache placement patterns:

Client-side caching stores responses in the browser or client application. Zero server load for cached responses. The client controls freshness through HTTP cache headers. Effective for static content, user-specific data that rarely changes, and responses that are expensive to generate but identical across short time windows.

CDN caching stores responses at geographically distributed edge nodes. Serves content from a location close to the user. Reduces latency for geographically distributed users and eliminates origin server load for cacheable content. Effective for static assets, public API responses, and content that is identical across many users.

Application-level caching stores data in a fast in-memory store (Redis, Memcached) between the application layer and the database. Dramatically reduces database load for frequently read data. Introduces cache invalidation complexity. Effective for database query results, computed aggregations, and frequently accessed entities.

Database query caching stores query results at the database layer. Transparent to the application. Limited by database cache memory. Effective for identical repeated queries but does not help when queries vary by user or parameter.

The cache invalidation hierarchy:

When data changes, the following caches may contain stale copies:

User's browser cache
  CDN edge cache (one or many edge nodes)
    Application server cache (one or many application instances)
      Database query cache

Each layer needs an invalidation strategy.
The harder the invalidation, the more expensive it is to maintain consistency.
The easier the invalidation, the less value the cache provides.

Databases

The database choice is one of the most consequential and most irreversible decisions in system design. It must be made with a clear understanding of the access patterns, scale requirements, and consistency requirements of the specific system.

Relational databases (PostgreSQL, MySQL, SQLite)

Store data in tables with defined schemas and relationships. Support ACID transactions — Atomicity, Consistency, Isolation, Durability. Use SQL for querying. Support complex joins, aggregations, and ad-hoc queries. Scale reads well with read replicas. Scale writes less well — primary database is a write bottleneck.

Best for: Systems with complex relationships between entities, systems requiring ACID transactions (financial systems, inventory management), systems with evolving and ad-hoc query patterns, most general-purpose applications.

Document databases (MongoDB, CouchDB, Firestore)

Store data as flexible JSON-like documents. No fixed schema — documents in the same collection can have different structures. Horizontal scaling through sharding is built-in. Support document-level ACID transactions in modern versions.

Best for: Systems with evolving schemas, systems where data is naturally document-shaped, content management systems, user profiles, catalog data.

Key-value stores (Redis, DynamoDB, Memcached)

Store data as key-value pairs. Extremely fast for simple access patterns — get by key, set by key. Limited query capability beyond key lookup. Redis adds data structures (lists, sets, sorted sets, hashes) that enable sophisticated use cases at low latency.

Best for: Caching, session storage, leaderboards, rate limiting, real-time data, anything requiring sub-millisecond access by known key.

Column-family stores (Cassandra, HBase, Bigtable)

Store data in columns rather than rows, optimized for write-heavy workloads at massive scale. Excellent horizontal scaling characteristics. Limited transaction support. Requires upfront knowledge of access patterns to design column families effectively.

Best for: Time-series data, event logs, analytics data, IoT sensor data, any system requiring very high write throughput at scale.

Graph databases (Neo4j, Amazon Neptune)

Store data as nodes and edges, optimized for traversing relationships. SQL joins that would require many hops become single traversal queries. Limited performance and scaling for non-graph access patterns.

Best for: Social networks, recommendation engines, fraud detection, knowledge graphs, any system where relationships between entities are the primary data.

Search engines (Elasticsearch, Solr, Typesense)

Store data in inverted indexes optimized for full-text search and complex filtering. Support relevance ranking, faceting, aggregations, and typo tolerance. Not designed as primary databases — typically receive data from a primary database via sync pipeline.

Best for: Product search, document search, log analysis, any full-text search requirement.

Message Queues

A message queue decouples producers of work from consumers of work. Producers place messages into the queue without knowing or caring which consumer will process them. Consumers pull messages from the queue at their own pace without knowing or caring which producer created them.

What message queues enable:

Asynchronous processing — expensive or slow operations run in the background without blocking the request that initiated them. Email sending, image processing, report generation, and third-party API calls are all candidates for async processing through a queue.

Load leveling — a sudden spike in requests creates a burst of messages in the queue. Consumers process the queue at a steady rate rather than being overwhelmed by the spike. The queue absorbs the spike and smooths the processing load.

Decoupling — producers and consumers can be developed, deployed, and scaled independently. Adding a new consumer of an event requires no changes to the producer.

Durability — messages persisted in a queue survive consumer failures. When a consumer recovers, it processes the messages it missed.

The major queue patterns:

Point-to-point queuing — each message is consumed by exactly one consumer. Used for work distribution where each job should be processed once. Examples: job queues, task processing, order fulfillment events.

Publish-subscribe — each message is delivered to all subscribers. Used for event broadcasting where multiple systems need to react to the same event. Examples: user signup events triggering welcome email, analytics recording, and CRM update simultaneously.

Point-to-Point:
  Producer → [Queue] → Consumer A (processes the message)
             Consumer B (does not receive this message)
             Consumer C (does not receive this message)

Publish-Subscribe:
  Producer → [Topic] → Consumer A (receives and processes)
                     → Consumer B (receives and processes)
                     → Consumer C (receives and processes)

Content Delivery Networks

A CDN is a geographically distributed network of servers that cache content close to users. When a user requests content, they receive it from the nearest CDN edge node rather than from the origin server.

What CDNs accelerate:

Static assets — JavaScript, CSS, images, fonts, and video — are identical for all users and change infrequently. They are ideal CDN candidates. A JavaScript bundle served from a CDN node 20 milliseconds from the user is delivered dramatically faster than the same bundle served from an origin server 200 milliseconds away.

Dynamic content that is identical across users or large user segments can also be CDN-cached with appropriate cache headers and cache key configuration. A product catalog page that is the same for all anonymous users, updated hourly, is a valid CDN cache candidate.

CDN as a security layer:

Modern CDNs provide DDoS mitigation, bot detection, rate limiting, and Web Application Firewall capabilities at the edge. Attack traffic is absorbed at CDN edge nodes and never reaches the origin infrastructure. This makes CDNs a critical component of production security architecture, not just a performance optimization.

Part 3: The Foundational Patterns

With vocabulary and components established, the patterns — recurring architectural solutions to common system design problems.

The Client-Server Pattern

The foundational pattern of modern distributed systems. Clients request services from servers. Servers respond. The pattern separates concerns between those who consume functionality and those who provide it.

Variants:

Three-tier architecture adds a separation between presentation (client), application logic (server), and data storage (database). Each tier can be scaled, developed, and deployed independently. Most web applications follow this pattern.

N-tier architecture extends to additional layers — typically adding an API gateway, service layer, or caching layer between the client and the database.

Thin client vs thick client determines where application logic lives. Thin clients push most logic to the server and receive rendered output. Thick clients (single-page applications) pull data from the server and render it locally. Modern web applications are typically thick clients that communicate with API servers.

The Microservices Pattern

Microservices architecture decomposes a system into small, independently deployable services, each owning a specific bounded context of the business domain.

The core principles:

Each service owns its own data — no shared databases between services. Services communicate over well-defined APIs. Each service can be deployed, scaled, and failed independently. Services are organized around business capabilities, not technical layers.

Service communication patterns:

Synchronous (REST, gRPC) — the caller waits for a response before continuing. Simple to reason about, creates temporal coupling — if the downstream service is unavailable, the call fails.

Asynchronous (message queues, event streams) — the caller sends a message and continues without waiting for processing to complete. Decouples producer and consumer, enables higher availability, makes distributed transactions harder.

The microservices trade-off table:

Benefit	Cost
Independent deployability	Distributed system complexity
Independent scalability	Network latency on every cross-service call
Technology heterogeneity	Distributed transaction management
Fault isolation	Service discovery and load balancing overhead
Team autonomy	Operational complexity multiplied by service count

⚠️ The microservices prerequisite list: Before adopting microservices, a team needs robust automated testing, CI/CD pipelines for each service, distributed tracing, centralized log aggregation, service discovery, and mature operational capabilities. Teams that adopt microservices without these foundations trade a monolith's simplicity for a distributed system's complexity without gaining the distributed system's benefits.

The Event-Driven Pattern

In event-driven architecture, components communicate through events — messages that describe something that happened — rather than through direct calls. Producers emit events without knowing which consumers will react. Consumers subscribe to the events they care about without knowing which producers created them.

The event-driven vocabulary:

Event — an immutable record of something that happened. "Order placed with ID 12345 at 14:32:00." Events are facts about the past.

Event stream — an ordered, durable log of events. New events are appended. Old events are retained. Consumers can process the stream from any point, enabling replay, backfill, and late-joining consumers.

Event sourcing — storing the event stream as the primary source of truth, deriving current state by replaying events. Provides a complete audit log, enables temporal queries, allows rebuilding read models. Increases write complexity and requires careful event schema management.

CQRS (Command Query Responsibility Segregation) — separating the write model (commands that change state) from the read model (queries that return data). Commands update the event store. Events trigger updates to denormalized read models optimized for specific query patterns.

Traditional CRUD:
  Client → API → Database (read and write same model)

CQRS with Event Sourcing:
  Client → Command Handler → Event Store (write side)
  Event Store → Event Handlers → Read Models (multiple optimized views)
  Client → Query Handler → Read Models (read side, optimized for each use case)

The API Gateway Pattern

An API gateway is a single entry point for all client requests. It routes requests to appropriate backend services, handles cross-cutting concerns — authentication, rate limiting, logging, SSL termination — and presents a unified API surface to clients regardless of backend service structure.

What API gateways handle:

Request routing to backend services
Authentication and authorization enforcement
Rate limiting and quota management
Request and response transformation
SSL termination
Request logging and distributed trace initiation
Response caching
Circuit breaking for downstream services

The API gateway as a client compatibility layer:

When backend services evolve, the API gateway can translate between old and new API contracts without requiring clients to update simultaneously. This is particularly valuable for mobile clients where forcing an upgrade is not possible.

The Saga Pattern

Distributed transactions — operations that must span multiple services atomically — are one of the hardest problems in microservices architecture. The saga pattern provides a solution through a sequence of local transactions coordinated by either a central orchestrator or distributed choreography.

Orchestration saga:

A central orchestrator tells each service what to do and handles compensation (rollback) when a step fails.

Order Service creates order (status: pending)
  → Orchestrator tells Inventory Service to reserve items
    → Success: Orchestrator tells Payment Service to charge
      → Success: Orchestrator tells Order Service to confirm order
      → Failure: Orchestrator tells Inventory Service to release reservation
                 Orchestrator tells Order Service to cancel order
    → Failure: Orchestrator tells Order Service to cancel order

Choreography saga:

Each service reacts to events and emits events. No central coordinator. Each service knows what to do when it receives a specific event and what compensation event to emit if it needs to roll back.

Order Service creates order → emits OrderCreated event
  Inventory Service receives OrderCreated → reserves items → emits ItemsReserved
    Payment Service receives ItemsReserved → charges payment → emits PaymentCharged
      Order Service receives PaymentCharged → confirms order → emits OrderConfirmed

If payment fails:
  Payment Service emits PaymentFailed
    Inventory Service receives PaymentFailed → releases reservation → emits ItemsReleased
      Order Service receives ItemsReleased → cancels order → emits OrderCancelled

Part 4: Designing for Scale — The Decisions That Matter

The vocabulary is established. The components are understood. The patterns are familiar. Now the decisions — the choices that determine whether a system handles 10 users or 10 million.

Database Scaling: The Sequence That Works

Database scaling has a natural progression. Each step builds on the previous one and should be taken only when the previous step is insufficient. Skipping ahead introduces complexity that is not yet justified by the actual bottleneck.

Step 1: Optimize before scaling.

Queries that are slow with proper indexes will still be slow at any scale. Before adding infrastructure, instrument every slow query, add missing indexes, eliminate N-plus-1 patterns, and optimize the queries that consume the most total database time.

-- Find the queries worth optimizing first
SELECT
  query,
  calls,
  total_exec_time,
  total_exec_time / calls AS avg_ms,
  rows / calls AS avg_rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Step 2: Add read replicas for read-heavy workloads.

Most applications have read-to-write ratios of 10:1 or higher. Adding read replicas and routing read traffic to them multiplies the system's read capacity with relatively low architectural complexity. The primary handles all writes. Replicas handle reads and lag behind the primary by milliseconds to seconds.

Step 3: Add a caching layer.

For data that is read frequently and changes infrequently, a Redis cache in front of the database can reduce database load by 80 to 95 percent for those data patterns. The cache makes the effective read capacity of the system a function of Redis performance, which is orders of magnitude higher than database performance.

Step 4: Implement connection pooling.

Application servers create new database connections on startup. At scale, hundreds of application instances each trying to maintain tens of connections creates a connection count that exceeds what the database can efficiently handle. A connection pooler like PgBouncer multiplexes hundreds of application connections onto a smaller set of real database connections.

Step 5: Partition large tables.

Tables that have grown beyond 100 million rows can be partitioned by a natural dimension — date, geography, tenant, or user ID range. Queries that include the partition key in their WHERE clause only touch the relevant partition, dramatically reducing the data scanned.

Step 6: Shard when partitioning is insufficient.

Sharding distributes rows across multiple physically separate databases based on a shard key. Queries that include the shard key route to a single shard. Queries that do not include the shard key must fan out to all shards, making shard key selection critical.

Sharding by user ID:
  Users 1 to 10,000,000:     Shard 1
  Users 10,000,001 to 20,000,000: Shard 2
  Users 20,000,001 to 30,000,000: Shard 3

Implication: Any query filtering by user_id routes to one shard.
Any query NOT filtering by user_id hits all shards simultaneously.
Cross-shard joins are impossible — data that must be joined must live on the same shard.

Designing for High Availability

High availability requires eliminating every single point of failure — every component whose failure would make the system inaccessible. The framework is systematic: list every component, identify which have no redundancy, and add redundancy in order of criticality.

The single point of failure audit:

Component          Redundant?  If fails...         Solution
Load balancer      No          All traffic stops   Active-passive LB pair
Application server Yes (4)     25% capacity loss   Auto-scaling group
Database primary   No          Writes impossible   Primary with hot standby
Redis cache        No          Cache misses surge  Redis Sentinel or Cluster
Message queue      Yes (3)     33% capacity loss   Already redundant
CDN                Yes (many)  Edge node fails     CDN has built-in redundancy
DNS                Yes         Traffic stops       Multiple DNS providers

The availability math — serial system components:

When system components are arranged in series (each must be available for the system to be available), the overall availability is the product of each component's availability.

Component availabilities:
  Load balancer:    99.99%
  Application tier: 99.95%
  Database:         99.90%
  Cache:            99.99%

System availability: 0.9999 × 0.9995 × 0.9990 × 0.9999
                   = 0.9983 = 99.83%

Even though every component exceeds 99.9%,
the system only achieves 99.83% because failures compound.

Active-active vs active-passive redundancy:

Active-passive maintains one active instance and one standby that takes over when the active fails. Simple, but the standby capacity is wasted during normal operation and failover introduces a brief interruption.

Active-active maintains multiple active instances, all serving traffic simultaneously. Full utilization of redundant capacity. Failover is seamless — remaining instances absorb the failed instance's traffic. Requires that each instance can handle the full load or that load balancing redistributes appropriately.

The Caching Architecture Decision

Every caching architecture decision involves three questions that must be answered explicitly before implementation.

What to cache:

Cache data that is expensive to compute or retrieve and that is requested frequently enough that the cache hit rate justifies the added complexity. Data that changes with every request or that is unique to every user provides minimal caching benefit.

How long to cache it:

The TTL (time to live) determines the maximum staleness users will observe. Shorter TTLs mean fresher data and more origin requests. Longer TTLs mean more stale data and fewer origin requests. The right TTL is the longest duration during which stale data is acceptable for the specific use case.

How to invalidate it:

TTL-based expiration is simple but delivers stale data for the full TTL duration after an update. Event-based invalidation — deleting or updating cache entries when the underlying data changes — provides fresher data at the cost of more complex invalidation logic. Write-through caching updates the cache on every write, guaranteeing freshness at the cost of slower writes.

Designing the Data Model for Scale

Data model decisions made in the first week of a project can constrain system performance for the next decade. The decisions that have the most impact are not the ones that feel weighty — they are the ordinary-seeming choices that turn out to be irreversible.

The access pattern first principle:

Design data models from access patterns outward, not from entity relationships inward. The relational normalization instinct — model entities and their relationships, then derive access patterns — produces models that are correct from a data integrity standpoint and potentially catastrophic from a performance standpoint.

Wrong approach (schema-first):
  1. Model the entities: User, Order, Product, Category, Review
  2. Define relationships: User has many Orders, Order has many Products...
  3. Normalize to 3NF
  4. Add indexes
  5. Discover at load test that the primary user flow requires a 7-table join

Right approach (access-pattern-first):
  1. List every read and write operation the system will perform
  2. Identify the frequency and performance requirements of each
  3. Design the schema to serve the highest-frequency, highest-stakes operations efficiently
  4. Accept denormalization where it serves access pattern requirements
  5. Use relational integrity for the operations where it matters most

The N-plus-1 pattern — design it out before it appears:

The N-plus-1 problem occurs when an application fetches a list of N records and then makes N additional queries to fetch related data for each record. It is almost always introduced by ORMs used without attention to the SQL they generate.

N+1 pattern that kills performance at scale:

// Fetches 100 orders (1 query)
const orders = await Order.findAll({ limit: 100 });

// Fetches the customer for each order (100 more queries)
const customers = await Promise.all(
  orders.map(order => Customer.findById(order.customerId))
);

// Total: 101 database queries for what should be 1 query with a JOIN

// Correct approach: eager loading
const orders = await Order.findAll({
  limit: 100,
  include: [{ model: Customer, as: 'customer' }]
});
// Total: 1 query with a JOIN

Part 5: Real System Designs — Applying the Patterns

The concepts become concrete through application. Four system designs that illustrate how the patterns combine into production architectures.

Design 1: A URL Shortener

A URL shortener converts long URLs to short codes and redirects users from short codes to original URLs. Despite its apparent simplicity, it is an excellent system design exercise because it touches encoding, databases, caching, and high-read workloads.

Requirements:

Shorten a URL and return a short code
Redirect from short code to original URL within 10 milliseconds
Handle 100 million redirects per day
Short codes should be unique and collision-resistant

The encoding decision:

Base62 encoding (a-z, A-Z, 0-9) allows 62 characters per position. A 7-character code gives 62 to the power of 7 = 3.5 trillion unique combinations. At 100 million new URLs per day, this provides 95 years of unique codes.

Short code generation:

Option A: Auto-increment ID encoded to base62
  Simple, no collision risk, predictable and guessable
  ID 12345 → base62 encode → "dnh"
  Problem: Sequential codes are predictable

Option B: Random base62 string with collision check
  7 random characters from base62 alphabet
  Check if code exists in database before inserting
  Retry on collision (extremely rare with 3.5T space)
  Better: Codes are not sequential or guessable

Option C: MD5 or SHA-256 hash of URL, take first 7 characters
  Deterministic — same URL always produces same code
  No database check needed for deduplication
  Collision probability exists but is very low for 7 characters

The architecture:

Client → CDN (cache popular redirects at edge)
       → Load Balancer
       → Application Servers (stateless, horizontally scalable)
       → Redis (cache URL mappings for hot short codes — 99% cache hit rate expected)
       → PostgreSQL (source of truth for all URL mappings)
         - Table: urls (id, short_code, original_url, created_at, click_count)
         - Index: unique index on short_code (primary lookup)
         - Index: index on original_url (deduplication check)

The bottleneck analysis:

Redirects are the dominant operation at 100 million per day — roughly 1,160 per second. With 80 percent of traffic going to 20 percent of URLs (Pareto distribution), a Redis cache with LRU eviction will serve the vast majority of redirects from memory with sub-millisecond lookup. Database is only hit for cache misses.

Design 2: A Rate Limiter

Rate limiters prevent API abuse by limiting how frequently a client can make requests. They must be fast (adding minimal latency to every request), accurate (not allowing significantly more than the configured rate), and distributed (working consistently across multiple application servers).

The algorithms:

Token bucket — each client has a bucket with a maximum token count. A token is added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected. Allows bursting up to the bucket size. Smooths out request rates over time.

Fixed window counter — count requests per client per time window (e.g., per minute). Reset the counter at the start of each window. Simple but has an edge case: a client can make double the allowed rate by making requests at the end of one window and the start of the next.

Sliding window log — maintain a log of request timestamps per client. On each request, remove old entries outside the window and check if the count of recent entries exceeds the limit. Accurate but memory-intensive at scale.

Sliding window counter — a hybrid that combines the space efficiency of fixed window with the accuracy of sliding window by interpolating between the current and previous window counts.

The distributed rate limiter design:

Every request passes through the rate limiter before reaching the application:

Client → Load Balancer → Rate Limiter Middleware → Application

Rate Limiter Middleware:
  1. Extract client identifier (API key, user ID, or IP address)
  2. Build Redis key: "rate_limit:{client_id}:{current_window}"
  3. Execute atomic Redis operations:

     -- Token bucket in Redis
     local key = KEYS[1]
     local max_tokens = tonumber(ARGV[1])
     local refill_rate = tonumber(ARGV[2])
     local now = tonumber(ARGV[3])

     local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
     local tokens = tonumber(bucket[1]) or max_tokens
     local last_refill = tonumber(bucket[2]) or now

     local elapsed = now - last_refill
     tokens = math.min(max_tokens, tokens + elapsed * refill_rate)

     if tokens >= 1 then
       tokens = tokens - 1
       redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
       redis.call('EXPIRE', key, 3600)
       return 1  -- allowed
     else
       return 0  -- rejected
     end

  4. If allowed: pass request to application
  5. If rejected: return 429 Too Many Requests with Retry-After header

Design 3: A Notification System

A notification system dispatches notifications to users through multiple channels — push notifications, email, SMS, in-app — based on user preferences and notification type.

Requirements:

Send notifications through multiple channels based on user preferences
Handle 10 million notifications per day
Guarantee at-least-once delivery
Respect user quiet hours and channel preferences
Support notification templates and personalization

The architecture:

Event Sources (many services that generate notification triggers)
  → Notification Service API
    → Preference Service (check user channel preferences and quiet hours)
    → Template Service (render notification content from template + user data)
    → Message Queue (durable, at-least-once delivery guarantee)
      → Push Notification Worker → APNs (iOS) / FCM (Android)
      → Email Worker → SendGrid / SES
      → SMS Worker → Twilio
      → In-App Worker → WebSocket connections or polling endpoint
    → Notification Log (record of every notification sent, for deduplication and debugging)

The deduplication requirement:

At-least-once delivery means the same notification may be delivered multiple times if a worker fails and retries. Deduplication logic must prevent users from receiving duplicate notifications.

Deduplication using idempotency keys:

Each notification event carries an idempotency_key generated by the producer.
Before sending, the worker checks if this key has been processed:

async function sendNotification(notification: Notification) {
  const dedupKey = `notif:sent:${notification.idempotencyKey}`;

  const alreadySent = await redis.set(
    dedupKey,
    "1",
    "NX",      // Only set if Not eXists
    "EX",      // Expire after
    86400      // 24 hours
  );

  if (!alreadySent) {
    logger.info({ idempotencyKey: notification.idempotencyKey },
      "Duplicate notification skipped");
    return;
  }

  await dispatch(notification);
}

Design 4: A Social Media Feed

A feed system aggregates and ranks content from accounts a user follows. It is one of the most technically interesting system design problems because the naive implementation — query all posts from followed accounts, sort by time — breaks spectacularly at scale.

The problem with naive feed generation:

A user follows 500 accounts. Each account posts an average of 3 times per day. To generate the feed, the system queries 500 account timelines, merges 1,500 posts, sorts them, and returns the top 20. At 100 million users each refreshing their feed 10 times per day, this is 1 billion expensive fan-in queries per day. The database does not survive this.

The two solutions and their trade-offs:

Fan-out on write (push model) — when a user posts, immediately write the post to the feed cache of all their followers.

User A (10K followers) posts:
  → Write post to database
  → For each of 10K followers:
    → Add post ID to follower's feed cache (Redis sorted set by timestamp)

Feed read:
  → Read from user's pre-computed feed cache (fast, single Redis call)
  → Fetch post content for the top N post IDs

Trade-off:
  Pros: Feed reads are extremely fast
  Cons: A celebrity with 50M followers triggers 50M write operations per post
        Celebrities make this model impractical without special handling

Fan-out on read (pull model) — when a user requests their feed, query the timeline of every account they follow in real time.

User requests feed:
  → Fetch list of accounts user follows
  → For each followed account, fetch recent posts
  → Merge and sort all posts
  → Return top N

Trade-off:
  Pros: No fan-out write cost, works for users with any number of followers
  Cons: Feed reads are expensive, latency is high, does not scale for active users

The hybrid model used at scale:

Regular users use fan-out on write — their posts are distributed to followers' caches at write time. Celebrities (accounts above a follower threshold) use fan-out on read — their posts are pulled and merged at read time. This eliminates the catastrophic write amplification of fan-out on write for celebrities while keeping feed reads fast for most users.

Feed generation algorithm:

async function generateFeed(userId: string, page: number) {
  // Get pre-computed feed from cache (covers non-celebrity follows)
  const cachedFeed = await redis.zrevrange(
    `feed:${userId}`,
    page * 20,
    (page + 1) * 20 - 1
  );

  // Get celebrity accounts this user follows
  const celebrities = await getCelebrityFollows(userId);

  // Pull recent celebrity posts in real time
  const celebrityPosts = await Promise.all(
    celebrities.map(c => getRecentPosts(c.id, limit: 20))
  );

  // Merge and sort by timestamp
  const allPosts = mergeSortedByTimestamp([
    cachedFeed,
    ...celebrityPosts
  ]);

  // Return top 20
  return allPosts.slice(0, 20);
}

Part 6: The System Design Interview Framework

System design knowledge is tested in engineering interviews. The framework that separates candidates who get offers from candidates who do not is not technical depth — it is structured thinking.

The 45-minute system design interview framework:

Minutes 1 to 5: Requirements clarification

Ask every clarifying question before drawing a single box. What scale? What are the primary use cases? What are the performance requirements? What consistency model is required? What is the expected read-to-write ratio?

Essential questions for any system design problem:
  What is the expected daily active user count?
  What is the expected request volume per second at peak?
  What is the acceptable latency for the primary operations?
  What is the durability requirement — can we lose any data?
  What is the consistency requirement — eventual or strong?
  What is the geographic distribution of users?
  What are the primary read and write operations?

Minutes 5 to 10: Capacity estimation

Back-of-envelope calculations to establish the scale of the problem and inform design choices.

Example: Design a photo sharing system

Assumptions:
  Daily active users: 50 million
  Photos uploaded per day: 5 million (10% of DAU upload daily)
  Average photo size: 3 MB
  Read-to-write ratio: 100:1

Storage requirements:
  5 million photos × 3 MB = 15 TB per day
  15 TB × 365 = 5.5 PB per year
  → Object storage (S3 or equivalent) is required, not a database

Throughput requirements:
  Upload: 5M / 86,400 seconds ≈ 58 uploads per second
  Read: 58 × 100 = 5,800 reads per second
  → Read-heavy, caching and CDN are high priority

Minutes 10 to 20: High-level design

Draw the major components and their connections. Do not go deep on any single component yet. Establish the overall architecture.

Minutes 20 to 35: Deep dive on critical components

The interviewer will direct you toward the most interesting or challenging components. Go deep on those. Discuss trade-offs, failure modes, and alternatives.

Minutes 35 to 45: Identify bottlenecks and scale

Where are the single points of failure? What happens when traffic doubles? How does the design evolve as scale increases? What trade-offs did you make and would you make differently at different scales?

The System Design Mindset

The guide has covered vocabulary, components, patterns, real designs, and interview frameworks. The thread connecting all of it is a mindset — a way of approaching complexity that makes system design tractable rather than overwhelming.

The questions that anchor every system design:

What are the actual performance requirements — not aspirational ones, but the specific numbers that define success and failure?

What are the access patterns — not the data model, but the operations the system will perform and their relative frequency?

Which decisions are reversible and which are not — and am I investing deliberation effort proportionally?

What are the single points of failure and have I addressed the ones whose failure would be unacceptable?

What trade-offs am I making and am I making them explicitly rather than accidentally?

The maturity progression in system design:

Junior engineers design systems to work. They focus on correctness — does the system produce the right output for the given input?

Mid-level engineers design systems to work and perform. They add consideration for scale, latency, and throughput.

Senior engineers design systems to work, perform, and fail gracefully. They design for the failure modes that will occur, not just the happy path.

Principal engineers design systems to work, perform, fail gracefully, and evolve. They make architectural decisions that keep future options open rather than foreclosing them prematurely.

~~Good system design is knowing all the right answers.~~ Good system design is asking the right questions in the right order, making trade-offs explicitly, and building systems that are honest about their constraints rather than systems that pretend the constraints do not exist.

Every system is eventually replaced. The systems that serve their purpose well for the longest time are not the most technically sophisticated — they are the most honestly designed. They know what they are, what they are not, what they can handle, and what they cannot. They are built by engineers who asked the right questions before drawing the first box.

The Complete Guide to System Design: From Basics to Production

The Complete Guide to System Design: From Basics to Production

Most Systems Are Not Designed. They Are Grown. And the Ones That Were Grown Without Intention Are the Ones That Wake Engineers Up at 3am.

Part 1: The Vocabulary of System Design

Scalability

Availability

Reliability

Consistency

Latency and Throughput

Part 2: The Core Components

Load Balancers

Caches

Databases

Message Queues

Content Delivery Networks

Part 3: The Foundational Patterns

The Client-Server Pattern

The Microservices Pattern

The Event-Driven Pattern

The API Gateway Pattern

The Saga Pattern

Part 4: Designing for Scale — The Decisions That Matter

Database Scaling: The Sequence That Works

Designing for High Availability

The Caching Architecture Decision

Designing the Data Model for Scale

Part 5: Real System Designs — Applying the Patterns

Design 1: A URL Shortener

Design 2: A Rate Limiter

Design 3: A Notification System

Design 4: A Social Media Feed

Part 6: The System Design Interview Framework

The System Design Mindset

Girish Sharma

Related Posts

Thread in java

Zero-Downtime Deployments: The Complete Playbook

The Architecture of PostgreSQL: How Queries Actually Execute

Comments (0)