High-Level System Design: A Complete Guide for Engineers

System design is the process of defining the architecture, components, interfaces, and data flow of a system to satisfy specified requirements. It is where engineering judgment meets real-world constraints -- you are not finding the one correct answer, you are making defensible tradeoffs.

This guide covers how to think about system design, what constraints actually matter, the building blocks every large system uses, and how to work through any design problem from scratch.

How to Approach Any System Design Problem

Before drawing a single box, follow this framework every time. Skipping steps is how you end up designing the wrong system perfectly.

Step 1 -- Clarify Requirements (5 minutes)

Never assume. Ask questions before designing:

Functional requirements -- what must the system do?

What are the core features? (must-have vs nice-to-have)
Who are the users? (consumers, internal teams, other services)
What does success look like?

Non-functional requirements -- how must it behave?

Scale: how many users, requests per second, data volume?
Availability: 99.9% (8.7 hours downtime/year) or 99.99% (52 minutes)?
Latency: p99 under 200ms? Real-time or eventual consistency?
Consistency: strong or eventual?
Read/write ratio: read-heavy (Twitter timeline) or write-heavy (logging)?

Constraints:

Budget (cloud vs bare metal, managed vs self-hosted)
Team size and expertise
Regulatory (GDPR, HIPAA, PCI-DSS)
Existing infrastructure to integrate with

Step 2 -- Estimate Scale (5 minutes)

Back-of-envelope calculations ground your design decisions.

Daily Active Users (DAU):     10 million
Requests per user per day:    20
Total requests/day:           200 million
Requests per second (avg):    200M / 86,400 = ~2,300 RPS
Peak RPS (10x avg):           23,000 RPS

Storage estimate:
  - 1 tweet = 300 bytes text + metadata
  - 500M tweets/day = 150 GB/day
  - 5 years = 150 GB * 365 * 5 = ~274 TB

Bandwidth estimate:
  - Each request serves 1 KB average response
  - 2,300 RPS * 1 KB = 2.3 MB/s ingress
  - If read-heavy (10:1 ratio): 23 MB/s egress

Memory of useful numbers:

Item	Approximate size
Tweet / short message	300 bytes
User profile record	1 KB
Metadata row (SQL)	100-500 bytes
Photo (compressed)	200 KB -- 2 MB
Video (1 min, 720p)	50 MB
1 million entries in memory	~1 GB (rough)
MySQL: rows per server	~50M comfortably
Redis: entries per instance	~100M (depends on value size)

Step 3 -- Define the API (5 minutes)

Define the interface before the internals. This forces clarity on what the system actually does.

POST   /tweets              -- create a tweet
GET    /tweets/{id}         -- get a tweet
GET    /users/{id}/timeline -- get user's home timeline
POST   /users/{id}/follow   -- follow a user
GET    /search?q=&limit=    -- search tweets

Step 4 -- High-Level Design (10 minutes)

Draw the major components and their relationships. Start simple, then evolve:

Client --> Load Balancer --> API Servers --> Database
                                 |
                            Cache Layer

Then refine:

Client --> CDN --> Load Balancer --> API Gateway
                                         |
                          ┌──────────────┼──────────────┐
                          |              |              |
                     User Service   Tweet Service   Timeline Service
                          |              |              |
                        SQL DB      SQL DB + S3      Redis Cache
                                                         |
                                                   Message Queue
                                                         |
                                                  Fan-out Workers

Step 5 -- Deep Dive into Critical Components (15 minutes)

Pick the hardest or most interesting parts and go deep:

The most write-heavy component
The component that must be highly available
The component with the hardest consistency requirements
The data model

Step 6 -- Identify and Resolve Bottlenecks

For each component ask:

What happens if this fails?
What happens at 10x current load?
Where is the single point of failure?
Where is the hotspot (one node getting all traffic)?

Constraints and Tradeoffs

Every design decision is a tradeoff. Understanding these is the difference between junior and senior design.

The CAP Theorem

A distributed system can guarantee at most two of three properties:

Consistency (C): Every read receives the most recent write
Availability (A): Every request receives a response (not guaranteed to be latest)
Partition Tolerance (P): System continues operating when network partitions occur

Since network partitions are inevitable in distributed systems, you are always choosing between CP or AP:

System	Choice	Reasoning
Banking, financial transactions	CP	Wrong balance is worse than downtime
Social media feeds	AP	Showing slightly stale content is fine
DNS	AP	Availability matters more than immediate propagation
HBase	CP	Strong consistency for analytics
Cassandra	AP	Tunable consistency, defaults to AP
DynamoDB	AP (configurable)	Eventual consistency by default

PACELC Extension

PACELC extends CAP: even when there is no partition, you choose between latency (L) and consistency (C):

Low latency reads often mean serving from replicas (eventual consistency)
Strong consistency requires coordination overhead (higher latency)

Latency Numbers Every Engineer Should Know

L1 cache reference:              0.5 ns
L2 cache reference:              7 ns
Main memory reference:           100 ns
Read 1 MB from memory:           250 microseconds
SSD random read:                 150 microseconds
SSD sequential read (1 MB):      1 ms
HDD random seek:                 10 ms
Network: same datacenter:        0.5 ms
Network: cross-region:           30-100 ms
Network: cross-continent:        100-300 ms

Practical implications:

A database query with a disk seek: ~10ms
In-memory cache hit: ~0.1ms -- 100x faster
Cross-region database replication lag: 30-100ms
This is why caching exists: to serve 100x faster from memory

The Four Scalability Dimensions

Scale Up (Vertical)      -- bigger machine, more RAM/CPU
Scale Out (Horizontal)   -- more machines, distribute load
Scale Data (Sharding)    -- split data across multiple DBs
Scale Reads (Replication)-- multiple read replicas

Core Building Blocks

Load Balancers

A load balancer distributes incoming traffic across multiple servers.

Algorithms:

Round Robin -- requests go to servers in turn (simple, works well when servers are equal)
Weighted Round Robin -- larger servers get more traffic
Least Connections -- route to server with fewest active connections (better for long-lived connections)
IP Hash -- same client always goes to same server (sticky sessions, useful for stateful apps)
Consistent Hashing -- minimal redistribution when servers are added/removed

Layer 4 vs Layer 7:

L4 (transport): routes based on IP and TCP port -- fast, no content inspection
L7 (application): routes based on URL, headers, cookies -- slower but smarter

Health checks: Load balancers should actively probe backends and remove unhealthy instances.

When to use:

Any service with multiple instances needs a load balancer
Place one in front of API servers, another in front of database read replicas

DNS and CDN

DNS maps domain names to IP addresses. For large systems:

Use multiple A records for basic round-robin load distribution
Use Anycast routing to direct users to geographically closest datacenter
GeoDNS routes users to the nearest region

CDN (Content Delivery Network) caches static content at edge servers close to users:

Reduces latency for images, CSS, JS, videos
Absorbs read traffic, protecting origin servers
Two modes:
- Push CDN: you proactively push content to edge (good for predictable large files)
- Pull CDN: edge fetches from origin on first request, caches for subsequent (good for large catalogs)

When to use CDN: any static or semi-static content (images, videos, HTML, JS), or API responses that can tolerate slight staleness.

Caching

Caching stores expensive computation or data retrieval results closer to the consumer.

Cache placement:

Client-side: browser cache, mobile app cache
CDN: geographic edge cache
Application-level: in-process cache (Caffeine, Guava for JVM)
Distributed cache: Redis, Memcached (shared across service instances)
Database query cache: avoid for write-heavy systems

Cache strategies:

Cache-Aside (Lazy Loading):
  1. App checks cache
  2. Cache miss: app reads from DB
  3. App writes result to cache
  4. Next request: cache hit
  -- Con: first request is slow, cache can be stale

Write-Through:
  1. App writes to cache
  2. Cache synchronously writes to DB
  -- Pro: cache always fresh
  -- Con: every write hits both, higher write latency

Write-Behind (Write-Back):
  1. App writes to cache only
  2. Cache asynchronously writes to DB (batched)
  -- Pro: very fast writes
  -- Con: data loss if cache fails before flush

Read-Through:
  Cache sits in front of DB
  App always talks to cache
  Cache handles DB fetching on miss
  -- Good for read-heavy with cache library support

Cache eviction policies:

LRU (Least Recently Used): evict the item not used for longest time -- most common
LFU (Least Frequently Used): evict the item used least often
TTL (Time To Live): evict after a fixed duration -- simplest, good for session data

Cache invalidation strategies:

TTL-based: entries expire after N seconds -- simple, may serve stale data
Event-driven: write to DB triggers cache invalidation -- consistent, more complex
Cache-aside with short TTL: common compromise

Thundering herd / cache stampede: When many requests arrive simultaneously after a cache entry expires, they all hit the database. Solutions:

Add jitter to TTLs (randomize expiry time)
Lock-based cache population (only one request fetches, others wait)
Probabilistic early expiration

Databases

Choosing the right database is one of the most consequential design decisions.

Relational Databases (SQL)

Use when:

Data has clear relationships with foreign keys
You need ACID transactions (money transfers, inventory)
Schema is relatively stable
Complex queries with joins are needed

Popular choices: PostgreSQL, MySQL, MariaDB

Scaling patterns:

Read Replicas:
  Primary (writes) --> Replica 1 (reads)
                   --> Replica 2 (reads)
                   --> Replica 3 (reads)
  -- Handles read-heavy workloads
  -- Replication lag means replicas may be slightly behind

Vertical Sharding (by function):
  Users DB | Orders DB | Products DB
  -- Different tables on different servers
  -- Simple but limited

Horizontal Sharding (by data):
  Shard by user_id % N:
  Shard 0: users 0, 3, 6, 9...
  Shard 1: users 1, 4, 7, 10...
  Shard 2: users 2, 5, 8, 11...
  -- Scales writes
  -- Cross-shard queries are hard

Sharding keys -- choosing well:

Good: user_id (spreads load evenly, most queries are per-user)
Bad: timestamp (new data all goes to one shard -- "hotspot")
Bad: geographic region (uneven distribution)

Consistent hashing for sharding: Maps keys to a ring, minimising redistribution when nodes are added or removed. Used by Cassandra, DynamoDB, and many cache systems.

NoSQL Databases

Type	Examples	Use When
Key-Value	Redis, DynamoDB	Session data, caching, simple lookups
Document	MongoDB, CouchDB	Semi-structured data, flexible schema
Wide-Column	Cassandra, HBase	Time-series, write-heavy, huge scale
Graph	Neo4j, Amazon Neptune	Relationships are the primary query (social graph, recommendations)

When to choose NoSQL:

Schema evolves frequently
Massive write scale (Cassandra handles 100K+ writes/sec per node)
Data is naturally document-shaped (user profiles, product catalogs)
You need geographic distribution built-in
Joins are rare or never needed

Cassandra specifics (common in large systems):

Designed for write-heavy workloads
Tunable consistency (ONE, QUORUM, ALL)
Data model: rows identified by partition key, sorted by clustering key
Anti-patterns: large partitions, allowing tombstones to accumulate

NewSQL

Attempts to combine relational semantics with NoSQL scalability:

CockroachDB, Google Spanner, TiDB, YugabyteDB
Distributed ACID transactions
Good for: globally distributed applications needing strong consistency

Message Queues and Event Streaming

Asynchronous communication decouples services, absorbs traffic spikes, and enables fan-out.

Message Queue (point-to-point): Each message consumed by one consumer. Good for task distribution.

Examples: RabbitMQ, ActiveMQ, AWS SQS

Event Streaming (pub/sub): Messages retained and can be consumed by multiple consumers independently.

Examples: Apache Kafka, AWS Kinesis, Google Pub/Sub

When to use:

Use a queue when:
  - Email sending after user registration
  - Image processing after upload
  - Order fulfillment after purchase
  - Any "fire and forget" task

Use event streaming when:
  - Multiple services react to the same event
  - You need event replay (audit log, rebuilding state)
  - Real-time analytics pipeline
  - Change data capture (CDC) from database

Kafka key concepts:

Topic: named stream of records
Partition: topics split into partitions for parallelism
Offset: position of a consumer in a partition
Consumer Group: set of consumers sharing partition assignments
Retention: Kafka stores messages for configurable period (default 7 days)

Guarantees:

At most once: messages may be lost, never duplicated (fast, acceptable for metrics)
At least once: messages never lost, may be duplicated (most common -- handle idempotency)
Exactly once: never lost, never duplicated (slowest, available in Kafka with transactions)

API Design

REST vs GraphQL vs gRPC

	REST	GraphQL	gRPC
Protocol	HTTP/1.1	HTTP/1.1	HTTP/2
Data format	JSON/XML	JSON	Protobuf (binary)
Typing	Loose	Strong	Strong
Fetching	Fixed endpoints	Client specifies fields	Generated stubs
Best for	Public APIs, simple CRUD	Mobile apps, variable data needs	Internal microservices
Performance	Good	Good	Excellent

REST design principles:

Use nouns for resources, not verbs (/users, not /getUser)
Use HTTP methods correctly (GET read, POST create, PUT replace, PATCH update, DELETE remove)
Return appropriate status codes (200, 201, 400, 401, 403, 404, 429, 500)
Version your API (/v1/users)
Use pagination for collections (?page=2&limit=50 or cursor-based)
Be consistent with naming conventions

API Gateway pattern: Single entry point for all clients. Handles:

Authentication and authorization
Rate limiting and throttling
Request routing to microservices
SSL termination
Request/response transformation
Analytics and monitoring

Rate Limiting

Protects services from overload and abuse.

Algorithms:

Token Bucket:
  - Bucket holds N tokens
  - Each request consumes one token
  - Tokens refill at fixed rate
  - Allows bursts up to bucket size
  - Used by: AWS, Stripe

Leaky Bucket:
  - Requests enter a queue
  - Queue drains at fixed rate
  - Smooths out bursts
  - Used by: traffic shaping

Fixed Window Counter:
  - Count requests in time window
  - Reset at window boundary
  - Problem: burst at window boundary (2x limit possible)

Sliding Window Log:
  - Store timestamp of each request
  - Count requests in sliding window
  - Accurate but memory-intensive

Sliding Window Counter:
  - Hybrid of fixed window and sliding log
  - Good accuracy, low memory
  - Used by: Cloudflare

Where to enforce rate limits:

API Gateway (per user, per IP, per API key)
Service level (per tenant)
Redis for distributed rate limiting state

Consistent Hashing

Standard hash: server = hash(key) % N -- adding/removing a server remaps ~N-1/N of keys.

Consistent hashing: server = first server clockwise from hash(key) on ring -- adding/removing a server remaps only 1/N of keys on average.

Used in: CDN routing, distributed caches (Redis Cluster), Cassandra, DynamoDB, load balancers.

Virtual nodes: each physical server gets multiple positions on the ring, improving load distribution.

Common System Design Patterns

Fan-Out

When one write must propagate to many places.

Fan-out on write (push model): When a user tweets, immediately update all followers' timelines.

Pro: fast reads (timeline pre-computed)
Con: slow writes, massive writes for celebrities (Justin Bieber problem)
Good for: users with small follower counts

Fan-out on read (pull model): When a user opens their timeline, fetch tweets from all followed accounts.

Pro: writes are fast
Con: slow reads (must aggregate at query time)
Good for: celebrity accounts

Hybrid approach (used by Twitter):

Regular users: fan-out on write
Celebrity accounts (>threshold followers): fan-out on read
Timeline assembly merges both

CQRS (Command Query Responsibility Segregation)

Separate the write model (commands) from the read model (queries):

Write side:  Client --> Command Handler --> Write DB
                                               |
                                         Event Bus
                                               |
Read side:   Client <-- Query Handler  <-- Read DB (denormalized)

When to use:

Read and write workloads have very different scale requirements
Read models need to be optimized differently (denormalized, different indexes)
Event sourcing architectures

Event Sourcing

Instead of storing current state, store a log of all events:

Traditional: Users table row: {id: 1, balance: 150}

Event Sourcing: Event log:
  {type: "AccountOpened", amount: 200}
  {type: "Withdrew", amount: 50}
  -- current state derived by replaying events

Benefits: complete audit trail, ability to replay history, temporal queries. Costs: more complex, eventual consistency, storage grows unbounded.

Saga Pattern

For distributed transactions across multiple services without two-phase commit:

Order Service:
  1. Create order (pending)      -- if fails, rollback
  2. Reserve inventory           -- if fails, cancel order
  3. Charge payment              -- if fails, release inventory + cancel order
  4. Update order to confirmed   -- if fails, refund + release inventory + cancel order

Each step has a compensating transaction for rollback.

Choreography saga: services react to events (no central coordinator, harder to debug) Orchestration saga: a central orchestrator commands each step (easier to understand)

Strangler Fig Pattern

For migrating a monolith to microservices incrementally:

1. New requests --> Router --> Monolith (all traffic)
2. Build new service for one feature
3. New requests --> Router --> New Service (feature X)
                           --> Monolith (everything else)
4. Repeat until monolith is gone

Bulkhead Pattern

Isolate failures so one failing component does not take down others:

Without bulkheads:
  API Server (one thread pool for all downstream calls)
  If Recommendation Service slows down → thread pool exhausted → everything fails

With bulkheads:
  API Server with separate thread pools:
  Pool A (20 threads) → User Service
  Pool B (20 threads) → Product Service
  Pool C (5 threads)  → Recommendation Service (can fail safely)

Circuit Breaker

Prevents cascading failures when a downstream service is degraded:

States:
  CLOSED  → requests pass through, failures counted
  OPEN    → requests fail fast (no downstream call), reset timer starts
  HALF-OPEN → one test request, if success → CLOSED, if fail → OPEN

Thresholds:
  Open when: failure rate > 50% in last 10 requests
  Half-open after: 30 seconds

Libraries: Resilience4j (Java), Polly (.NET), Hystrix (deprecated), Istio (service mesh).

Data Design Patterns

Denormalization for Read Performance

Normalized (good for writes, slow for reads):

SELECT u.name, t.text, COUNT(l.id) as likes
FROM tweets t
JOIN users u ON t.user_id = u.id
LEFT JOIN likes l ON t.id = l.tweet_id
WHERE u.id = 123
GROUP BY t.id, u.name, t.text;

Denormalized (fast reads, complex writes):

tweets_with_author table:
{ tweet_id, text, author_name, author_avatar, like_count }
-- No join needed, just query by user_id index

When to denormalize: read-to-write ratio > 10:1, query latency is critical.

Time-Series Data

Characteristics: append-only, queries are time-range based, high write rate.

Patterns:

Partition by time window (daily/hourly tables)
Store raw events + pre-aggregated rollups
Use columnar storage for compression

Dedicated databases: InfluxDB, TimescaleDB (PostgreSQL extension), Victoria Metrics, Amazon Timestream.

Retention tiers:

Raw data:     keep 7 days (high precision, high volume)
1-min rollup: keep 30 days
1-hour rollup: keep 1 year
1-day rollup:  keep forever

Distributed Unique ID Generation

Auto-increment IDs do not work in distributed systems (ID conflicts across shards).

Options:

UUID v4:
  - Truly random, no coordination needed
  - 128 bits, not sortable, large index size

Twitter Snowflake:
  - 64-bit integer
  - 41 bits: timestamp (ms precision, 69 year span)
  - 10 bits: machine ID
  - 12 bits: sequence within millisecond
  - Roughly time-sortable

Instagram-style:
  - 64-bit
  - 41 bits: timestamp
  - 13 bits: shard ID
  - 10 bits: sequence
  - Encodes the shard, simplifies routing

Database-assigned (with segments):
  - Central ID server issues segments (e.g., IDs 1001-2000)
  - Each service burns through its segment locally
  - Requests new segment when exhausted

Availability and Reliability

SLA, SLO, SLI

SLI (Service Level Indicator): a metric
  -- "request success rate" = successful requests / total requests

SLO (Service Level Objective): target value for an SLI
  -- "99.9% of requests succeed in a rolling 30-day window"

SLA (Service Level Agreement): contract with consequences
  -- "if SLO is missed, customer gets 10% credit"

Error budget: how much downtime your SLO allows.

99.9% availability  → 8.7 hours downtime/year   (43 minutes/month)
99.95%              → 4.4 hours/year             (22 minutes/month)
99.99%              → 52 minutes/year            (4 minutes/month)
99.999%             → 5 minutes/year             (26 seconds/month)

Replication Strategies

Single-leader (master-replica):
  - One primary accepts writes
  - Replicas receive changes asynchronously
  - Read from replicas (may be slightly stale)
  - Used by: MySQL, PostgreSQL, MongoDB

Multi-leader:
  - Multiple nodes accept writes
  - Changes sync to each other
  - Conflict resolution needed
  - Used by: Google Docs, CouchDB
  - Good for: multi-datacenter with low cross-region write latency

Leaderless (quorum-based):
  - Any node accepts writes
  - Write to W nodes, read from R nodes
  - W + R > N = guaranteed overlap (strong consistency)
  - W + R <= N = eventual consistency (higher availability)
  - Used by: Cassandra (QUORUM), DynamoDB

Replication factor: most systems use 3 replicas as the standard.

Disaster Recovery

RTO (Recovery Time Objective):    maximum acceptable downtime
RPO (Recovery Point Objective):   maximum acceptable data loss

Strategy options:

Backup and Restore:   RTO: hours    RPO: hours    Cost: low
Pilot Light:          RTO: 10 min   RPO: minutes  Cost: medium
Warm Standby:         RTO: minutes  RPO: seconds  Cost: medium-high
Multi-site Active:    RTO: seconds  RPO: ~zero    Cost: high

Microservices vs Monolith

When to Use a Monolith

Team is small (< 10 engineers)
System is new -- requirements are still changing
Deployment simplicity matters
Low traffic (< 10K RPS)

When to Migrate to Microservices

Independent scaling needed (checkout service needs 10x more capacity than user service)
Independent deployment (different teams, different release cadences)
Technology heterogeneity needed
Clear domain boundaries

Microservices Communication

Synchronous (request-response):
  - REST, gRPC
  - Simple, immediate response
  - Tight coupling: caller waits, failure propagates

Asynchronous (event-driven):
  - Message queues, Kafka
  - Loose coupling, resilience
  - More complex, eventual consistency

Service discovery:

Client-side: client queries service registry (Eureka, Consul), then calls service
Server-side: client calls load balancer, which queries registry

API Gateway vs Service Mesh

API Gateway:
  - North-south traffic (external clients to services)
  - Authentication, rate limiting, routing, SSL
  - Single ingress point

Service Mesh (Istio, Linkerd, Consul Connect):
  - East-west traffic (service to service)
  - Mutual TLS, circuit breaking, retries, observability
  - Sidecar proxy per service pod

Observability

You cannot debug what you cannot see.

The Three Pillars

Metrics:  aggregated numeric data over time
  -- Request rate, error rate, latency, CPU, memory
  -- Tools: Prometheus, Datadog, CloudWatch

Logs:     discrete event records
  -- Application logs, access logs, audit logs
  -- Tools: ELK stack, Loki + Grafana, Splunk

Traces:   end-to-end request path across services
  -- Shows where time is spent across microservices
  -- Tools: Jaeger, Zipkin, AWS X-Ray, Tempo

The RED Method (for services)

Rate: requests per second
Errors: error rate
Duration: request latency distribution (p50, p95, p99)

The USE Method (for resources)

Utilization: percentage time busy
Saturation: queue length, how much extra work waiting
Errors: error count

SLO Alerting

Alert on symptoms, not causes:

Bad:  "CPU > 80%" -- CPU high doesn't always mean users are impacted
Good: "p99 latency > 500ms" -- users are definitely experiencing this

Bad:  "Disk usage > 70%"
Good: "Error rate > 1%" or "time to fill disk < 4 hours"

Security Considerations

Authentication and Authorization

Authentication: who are you?
  -- Username/password, OAuth2, SAML, magic link

Authorization: what can you do?
  -- RBAC (role-based), ABAC (attribute-based), ACL

Token types:
  Session token: server-side state, opaque string
    + Easy to revoke, + Less data exposure
    - Does not scale without shared storage (Redis)

  JWT (JSON Web Token): self-contained, stateless
    + No server-side storage needed, + Scales easily
    - Cannot be revoked (use short TTL + refresh tokens)
    - Larger than session token

Recommended JWT approach:
  Access token: short TTL (15 minutes), used for API calls
  Refresh token: longer TTL (30 days), stored httpOnly cookie
  On expiry: use refresh token to get new access token

Common Security Patterns

Defense in depth: multiple security layers (network, application, data).

Zero trust: never trust, always verify. Even internal service-to-service calls are authenticated.

Secrets management: never hardcode secrets. Use Vault, AWS Secrets Manager, or Kubernetes Secrets (with encryption at rest).

SQL injection prevention: use parameterized queries / prepared statements, never string concatenation.

Rate limiting: prevent brute force, enumeration, and DoS.

Input validation: validate at the API layer, never trust client input.

Designing Specific Systems -- Worked Examples

Design a URL Shortener (bit.ly)

Requirements:

Shorten a long URL to a short code
Redirect short URL to original
100M URLs created/day, 10B redirects/day

Estimates:

Writes: 100M/day = 1,160 writes/sec
Reads:  10B/day  = 115,740 reads/sec
Read:Write ratio = 100:1 (read-heavy)
Storage: 500 bytes per URL * 100M/day * 5 years = ~90 TB

API:

POST /api/shorten   body: {long_url}  --> returns {short_url}
GET  /{code}                          --> 301/302 redirect

Short code generation:

Option 1: hash(long_url) + take 7 chars
  - Problem: collisions, same URL gets same code

Option 2: base62 encode a unique integer ID
  - 7 chars of base62 (a-z, A-Z, 0-9) = 62^7 = 3.5 trillion codes
  - Generate unique ID from distributed counter or Snowflake
  - Encode as base62 string

Architecture:

Write path:
  Client --> API Server --> ID Generator --> DB (URL table)
                       --> Cache (short:long mapping)

Read path:
  Client --> Load Balancer --> API Server
                 |                |
                CDN            Redis Cache (hit: 99%+)
                               DB (miss: 1%)

Database schema:

CREATE TABLE urls (
  id          BIGINT PRIMARY KEY,
  short_code  VARCHAR(8) UNIQUE NOT NULL,
  long_url    TEXT NOT NULL,
  user_id     BIGINT,
  created_at  TIMESTAMP DEFAULT NOW(),
  expires_at  TIMESTAMP,
  click_count BIGINT DEFAULT 0
);

CREATE INDEX idx_short_code ON urls(short_code);

Redirect code: use HTTP 301 (permanent, browser caches) vs 302 (temporary, always hits server -- better for analytics).

Design a Chat System (WhatsApp)

Requirements:

1:1 and group messaging
Online/offline status
Message delivery receipts (sent, delivered, read)
500M DAU, 100 billion messages/day

Estimates:

Messages: 100B/day = 1.15M messages/sec
Storage: 100B * 100 bytes avg = 10 TB/day

Key design decisions:

WebSockets for real-time:

HTTP polling:  client polls every N seconds -- high latency, wasteful
Long polling:  client holds connection open -- better but complex
WebSocket:     bidirectional persistent connection -- best for chat
Server-Sent Events: one-direction push -- good for notifications

Message storage and delivery:

Online user flow:
  Sender --> Chat Server A --> Chat Server B --> Receiver

Offline user flow:
  Sender --> Chat Server --> Message Queue --> Push Notification Service
                         --> Message DB (store for delivery when online)

Message schema:

messages:
  message_id   BIGINT (Snowflake ID -- time sortable)
  from_user_id BIGINT
  to_id        BIGINT (user_id for 1:1, group_id for group)
  content      TEXT
  type         ENUM(text, image, video, audio)
  status       ENUM(sent, delivered, read)
  created_at   TIMESTAMP

Group messaging fan-out:

Small groups (< 100 members):
  Fan-out on write: message copied to each member's inbox on send

Large groups (> 100 members):
  Fan-out on read: single message stored, members pull on connect

WhatsApp uses: single copy + delivery tracking per member

Connection management:

Use a presence service (Redis) to track which WebSocket server each user is connected to
Chat server queries presence service to route messages

Design a Distributed Cache (Redis-like)

Core operations: GET, SET, DELETE, EXPIRE

Data structures: string, list, set, hash, sorted set

Architecture:

Clients --> Hash to shard (consistent hashing) --> Cache Node

Single node per shard:
  - Simple but no HA

Primary-replica per shard:
  - Primary handles writes
  - Replicas handle reads + failover

Redis Cluster:
  - 16,384 hash slots distributed across nodes
  - Each key maps to a slot: slot = CRC16(key) % 16384
  - Gossip protocol for node health

Eviction when full:

noeviction:      reject new writes
allkeys-lru:     evict any key using LRU
volatile-lru:    evict TTL keys using LRU (keep non-TTL keys)
allkeys-random:  evict random key
volatile-ttl:    evict key with shortest TTL first

Design a Search Autocomplete

Requirements:

Show top 5 suggestions as user types
< 100ms latency
10M DAU, ~10 queries per session

Key data structure: Trie (prefix tree) or inverted index

Trie approach:

Words: apple, app, application, apply

Trie nodes:
  root
    a
      p
        p
          (end: "app")
          l
            e (end: "apple")
            y (end: "apply")
          l
            i
              c
                a
                  t
                    i
                      o
                        n (end: "application")

Search "app" traverses to node at "app", returns all suffixes

At scale:

Pre-compute top N suggestions per prefix
Store in Redis: prefix:app --> ["apple", "application", "apply", "apps", "append"]
Build from query logs -- most searched terms surface as suggestions
Update asynchronously (daily batch job, not real-time)

Filtering offensive suggestions:

Blocklist applied before serving results
Updated independently of main index

Interview Strategy

How to Perform Well in System Design Interviews

First 2 minutes -- clarify, do not design:

"Before I start designing, let me ask a few questions to make sure
I understand the scope..."

Think out loud. Interviewers want to see your reasoning process, not just the answer.

State your assumptions explicitly:

"I'll assume this is a global system with users in multiple regions.
I'll assume we need 99.99% availability.
I'll assume reads vastly outnumber writes."

Start simple, then evolve: Start with a single-server design that works. Then identify bottlenecks and add complexity only where needed.

Show tradeoff awareness:

"We could use Redis for the session store. The tradeoff is that
Redis adds operational complexity and is a potential single point
of failure -- we'd need Redis Cluster or Sentinel. Alternatively
we could use sticky sessions at the load balancer, which is simpler
but makes rolling deployments harder."

Numbers matter: estimating scale and sizing components demonstrates engineering judgment.

Common things interviewers look for:

Do you ask clarifying questions or just start drawing?
Can you estimate scale and let it drive design decisions?
Do you understand tradeoffs (not just which tech to use)?
Can you identify bottlenecks and single points of failure?
Do you know the building blocks (caching, queues, sharding)?
Can you go deep on at least one component?

Common Mistakes

Jumping to solutions before understanding requirements
Over-engineering from the start (microservices for 100 users)
Not considering failure modes ("what if the database goes down?")
Ignoring consistency requirements
Treating CAP/PACELC as theoretical rather than practical
Using technology buzzwords without explaining why
Not being able to estimate (orders of magnitude matter)

Quick Reference: Design Decision Matrix

Need	Solution
High read volume	Add cache (Redis), read replicas
High write volume	Sharding, message queue, write batching
Low latency reads	Cache at multiple levels, CDN for static
Data consistency across services	Saga pattern, 2PC (avoid if possible)
Service resilience	Circuit breaker, bulkhead, retry with backoff
Real-time communication	WebSockets, Server-Sent Events
Large file storage	Object storage (S3, GCS), CDN for delivery
Full-text search	Elasticsearch, Solr, Typesense
Time-series metrics	InfluxDB, TimescaleDB, Victoria Metrics
Graph queries	Neo4j, Amazon Neptune, or adjacency list in PostgreSQL
Global distribution	Multi-region deployment, Anycast DNS, CRDTs
Audit trail / event history	Event sourcing, append-only log
Schema flexibility	Document DB (MongoDB) or JSONB in PostgreSQL
Complex event processing	Kafka Streams, Flink, Spark Streaming
Unique IDs at scale	Snowflake IDs, UUID v7
Rate limiting at scale	Token bucket in Redis

Resources

Books:

Designing Data-Intensive Applications by Martin Kleppmann -- essential reading
System Design Interview by Alex Xu (Volumes 1 and 2)
Building Microservices by Sam Newman
Release It! by Michael Nygard (patterns for resilience)

Papers:

Google Bigtable (2006) -- wide-column store design
Amazon Dynamo (2007) -- consistent hashing, vector clocks, quorum
Google Spanner (2012) -- globally distributed transactions
Facebook Memcached (2013) -- caching at scale
LinkedIn Kafka (2011) -- distributed commit log

Online:

High Scalability -- real system teardowns
AWS Architecture Blog
Martin Fowler's blog -- patterns and architecture

Practice:

Design systems you actually use: how does Twitter, YouTube, or Uber work?
Read engineering blogs: Netflix Tech Blog, Uber Engineering, Airbnb Tech
Do mock interviews with a peer -- articulating design out loud is a skill

High-Level System Design: A Complete Guide for Engineers

Table of Contents

How to Approach Any System Design Problem

Step 1 -- Clarify Requirements (5 minutes)

Step 2 -- Estimate Scale (5 minutes)

Step 3 -- Define the API (5 minutes)

Step 4 -- High-Level Design (10 minutes)

Step 5 -- Deep Dive into Critical Components (15 minutes)

Step 6 -- Identify and Resolve Bottlenecks

Constraints and Tradeoffs

The CAP Theorem

PACELC Extension

Latency Numbers Every Engineer Should Know

The Four Scalability Dimensions

Core Building Blocks

Load Balancers

DNS and CDN

Caching

Databases

Relational Databases (SQL)

NoSQL Databases

NewSQL

Message Queues and Event Streaming

API Design

REST vs GraphQL vs gRPC

Rate Limiting

Consistent Hashing

Common System Design Patterns

Fan-Out

CQRS (Command Query Responsibility Segregation)

Event Sourcing

Saga Pattern

Strangler Fig Pattern

Bulkhead Pattern

Circuit Breaker

Data Design Patterns

Denormalization for Read Performance

Time-Series Data

Distributed Unique ID Generation

Availability and Reliability

SLA, SLO, SLI

Replication Strategies

Disaster Recovery

Microservices vs Monolith

When to Use a Monolith

When to Migrate to Microservices

Microservices Communication

API Gateway vs Service Mesh

Observability

The Three Pillars

The RED Method (for services)

The USE Method (for resources)

SLO Alerting

Security Considerations

Authentication and Authorization

Common Security Patterns

Designing Specific Systems -- Worked Examples

Design a URL Shortener (bit.ly)

Design a Chat System (WhatsApp)

Design a Distributed Cache (Redis-like)

Design a Search Autocomplete

Interview Strategy

How to Perform Well in System Design Interviews

Common Mistakes

Quick Reference: Design Decision Matrix

Resources