High-Level System Design: A Complete Guide for Engineers
Table of Contents
- How to Approach Any System Design Problem
- Constraints and Tradeoffs
- Core Building Blocks
- Common System Design Patterns
- Data Design Patterns
- Availability and Reliability
- Microservices vs Monolith
- Observability
- Security Considerations
- Designing Specific Systems -- Worked Examples
- Interview Strategy
- Quick Reference: Design Decision Matrix
- Resources
A comprehensive guide to High-Level System Design covering design principles, constraints, scalability patterns, databases, caching, messaging, load balancing, and how to systematically approach and solve any system design problem. System design is the process of defining the architecture, components, interfaces, and data flow of a system to satisfy specified requirements. It is where engineering judgment meets real-world constraints -- you are not finding the one correct answer, you are making defensible tradeoffs. This guide covers how to think about system design, what constraints actually matter, the building blocks every large system uses, and how to work through any design problem from scratch. Before drawing a single box, follow this framework every time. Skipping steps is how you end up designing the wrong system perfectly. Never assume. Ask questions before designing: Functional requirements -- what must the system do? Non-functional requirements -- how must it behave? Constraints: Back-of-envelope calculations ground your design decisions. Memory of useful numbers: Define the interface before the internals. This forces clarity on what the system actually does. Draw the major components and their relationships. Start simple, then evolve: Then refine: Pick the hardest or most interesting parts and go deep: For each component ask: Every design decision is a tradeoff. Understanding these is the difference between junior and senior design. A distributed system can guarantee at most two of three properties: Since network partitions are inevitable in distributed systems, you are always choosing between CP or AP: PACELC extends CAP: even when there is no partition, you choose between latency (L) and consistency (C): Practical implications: A load balancer distributes incoming traffic across multiple servers. Algorithms: Layer 4 vs Layer 7: Health checks: Load balancers should actively probe backends and remove unhealthy instances. When to use: DNS maps domain names to IP addresses. For large systems: CDN (Content Delivery Network) caches static content at edge servers close to users: When to use CDN: any static or semi-static content (images, videos, HTML, JS), or API responses that can tolerate slight staleness. Caching stores expensive computation or data retrieval results closer to the consumer. Cache placement: Cache strategies: Cache eviction policies: Cache invalidation strategies: Thundering herd / cache stampede: When many requests arrive simultaneously after a cache entry expires, they all hit the database. Solutions: Choosing the right database is one of the most consequential design decisions. Use when: Popular choices: PostgreSQL, MySQL, MariaDB Scaling patterns: Sharding keys -- choosing well: Consistent hashing for sharding: Maps keys to a ring, minimising redistribution when nodes are added or removed. Used by Cassandra, DynamoDB, and many cache systems. When to choose NoSQL: Cassandra specifics (common in large systems): Attempts to combine relational semantics with NoSQL scalability: Asynchronous communication decouples services, absorbs traffic spikes, and enables fan-out. Message Queue (point-to-point): Each message consumed by one consumer. Good for task distribution. Event Streaming (pub/sub): Messages retained and can be consumed by multiple consumers independently. When to use: Kafka key concepts: Guarantees: REST design principles: API Gateway pattern: Single entry point for all clients. Handles: Protects services from overload and abuse. Algorithms: Where to enforce rate limits: Standard hash: Consistent hashing: Used in: CDN routing, distributed caches (Redis Cluster), Cassandra, DynamoDB, load balancers. Virtual nodes: each physical server gets multiple positions on the ring, improving load distribution. When one write must propagate to many places. Fan-out on write (push model): When a user tweets, immediately update all followers' timelines. Fan-out on read (pull model): When a user opens their timeline, fetch tweets from all followed accounts. Hybrid approach (used by Twitter): Separate the write model (commands) from the read model (queries): When to use: Instead of storing current state, store a log of all events: Benefits: complete audit trail, ability to replay history, temporal queries. Costs: more complex, eventual consistency, storage grows unbounded. For distributed transactions across multiple services without two-phase commit: Choreography saga: services react to events (no central coordinator, harder to debug) Orchestration saga: a central orchestrator commands each step (easier to understand) For migrating a monolith to microservices incrementally: Isolate failures so one failing component does not take down others: Prevents cascading failures when a downstream service is degraded: Libraries: Resilience4j (Java), Polly (.NET), Hystrix (deprecated), Istio (service mesh). Normalized (good for writes, slow for reads): Denormalized (fast reads, complex writes): When to denormalize: read-to-write ratio > 10:1, query latency is critical. Characteristics: append-only, queries are time-range based, high write rate. Patterns: Dedicated databases: InfluxDB, TimescaleDB (PostgreSQL extension), Victoria Metrics, Amazon Timestream. Retention tiers: Auto-increment IDs do not work in distributed systems (ID conflicts across shards). Options: Error budget: how much downtime your SLO allows. Replication factor: most systems use 3 replicas as the standard. Service discovery: You cannot debug what you cannot see. Alert on symptoms, not causes: Defense in depth: multiple security layers (network, application, data). Zero trust: never trust, always verify. Even internal service-to-service calls are authenticated. Secrets management: never hardcode secrets. Use Vault, AWS Secrets Manager, or Kubernetes Secrets (with encryption at rest). SQL injection prevention: use parameterized queries / prepared statements, never string concatenation. Rate limiting: prevent brute force, enumeration, and DoS. Input validation: validate at the API layer, never trust client input. Requirements: Estimates: API: Short code generation: Architecture: Database schema: Redirect code: use HTTP 301 (permanent, browser caches) vs 302 (temporary, always hits server -- better for analytics). Requirements: Estimates: Key design decisions: WebSockets for real-time: Message storage and delivery: Message schema: Group messaging fan-out: Connection management: Core operations: GET, SET, DELETE, EXPIRE Data structures: string, list, set, hash, sorted set Architecture: Eviction when full: Requirements: Key data structure: Trie (prefix tree) or inverted index Trie approach: At scale: Filtering offensive suggestions: First 2 minutes -- clarify, do not design: Think out loud. Interviewers want to see your reasoning process, not just the answer. State your assumptions explicitly: Start simple, then evolve: Start with a single-server design that works. Then identify bottlenecks and add complexity only where needed. Show tradeoff awareness: Numbers matter: estimating scale and sizing components demonstrates engineering judgment. Common things interviewers look for: Books: Papers: Online: Practice: How to Approach Any System Design Problem
Step 1 -- Clarify Requirements (5 minutes)
Step 2 -- Estimate Scale (5 minutes)
Daily Active Users (DAU): 10 million
Requests per user per day: 20
Total requests/day: 200 million
Requests per second (avg): 200M / 86,400 = ~2,300 RPS
Peak RPS (10x avg): 23,000 RPS
Storage estimate:
- 1 tweet = 300 bytes text + metadata
- 500M tweets/day = 150 GB/day
- 5 years = 150 GB * 365 * 5 = ~274 TB
Bandwidth estimate:
- Each request serves 1 KB average response
- 2,300 RPS * 1 KB = 2.3 MB/s ingress
- If read-heavy (10:1 ratio): 23 MB/s egressItem Approximate size Tweet / short message 300 bytes User profile record 1 KB Metadata row (SQL) 100-500 bytes Photo (compressed) 200 KB -- 2 MB Video (1 min, 720p) 50 MB 1 million entries in memory ~1 GB (rough) MySQL: rows per server ~50M comfortably Redis: entries per instance ~100M (depends on value size) Step 3 -- Define the API (5 minutes)
POST /tweets -- create a tweet
GET /tweets/{id} -- get a tweet
GET /users/{id}/timeline -- get user's home timeline
POST /users/{id}/follow -- follow a user
GET /search?q=&limit= -- search tweets Step 4 -- High-Level Design (10 minutes)
Client --> Load Balancer --> API Servers --> Database
|
Cache LayerClient --> CDN --> Load Balancer --> API Gateway
|
┌──────────────┼──────────────┐
| | |
User Service Tweet Service Timeline Service
| | |
SQL DB SQL DB + S3 Redis Cache
|
Message Queue
|
Fan-out Workers Step 5 -- Deep Dive into Critical Components (15 minutes)
Step 6 -- Identify and Resolve Bottlenecks
Constraints and Tradeoffs
The CAP Theorem
System Choice Reasoning Banking, financial transactions CP Wrong balance is worse than downtime Social media feeds AP Showing slightly stale content is fine DNS AP Availability matters more than immediate propagation HBase CP Strong consistency for analytics Cassandra AP Tunable consistency, defaults to AP DynamoDB AP (configurable) Eventual consistency by default PACELC Extension
Latency Numbers Every Engineer Should Know
L1 cache reference: 0.5 ns
L2 cache reference: 7 ns
Main memory reference: 100 ns
Read 1 MB from memory: 250 microseconds
SSD random read: 150 microseconds
SSD sequential read (1 MB): 1 ms
HDD random seek: 10 ms
Network: same datacenter: 0.5 ms
Network: cross-region: 30-100 ms
Network: cross-continent: 100-300 ms The Four Scalability Dimensions
Scale Up (Vertical) -- bigger machine, more RAM/CPU
Scale Out (Horizontal) -- more machines, distribute load
Scale Data (Sharding) -- split data across multiple DBs
Scale Reads (Replication)-- multiple read replicas Core Building Blocks
Load Balancers
DNS and CDN
Caching
Cache-Aside (Lazy Loading):
1. App checks cache
2. Cache miss: app reads from DB
3. App writes result to cache
4. Next request: cache hit
-- Con: first request is slow, cache can be stale
Write-Through:
1. App writes to cache
2. Cache synchronously writes to DB
-- Pro: cache always fresh
-- Con: every write hits both, higher write latency
Write-Behind (Write-Back):
1. App writes to cache only
2. Cache asynchronously writes to DB (batched)
-- Pro: very fast writes
-- Con: data loss if cache fails before flush
Read-Through:
Cache sits in front of DB
App always talks to cache
Cache handles DB fetching on miss
-- Good for read-heavy with cache library support Databases
Relational Databases (SQL)
Read Replicas:
Primary (writes) --> Replica 1 (reads)
--> Replica 2 (reads)
--> Replica 3 (reads)
-- Handles read-heavy workloads
-- Replication lag means replicas may be slightly behind
Vertical Sharding (by function):
Users DB | Orders DB | Products DB
-- Different tables on different servers
-- Simple but limited
Horizontal Sharding (by data):
Shard by user_id % N:
Shard 0: users 0, 3, 6, 9...
Shard 1: users 1, 4, 7, 10...
Shard 2: users 2, 5, 8, 11...
-- Scales writes
-- Cross-shard queries are hard NoSQL Databases
Type Examples Use When Key-Value Redis, DynamoDB Session data, caching, simple lookups Document MongoDB, CouchDB Semi-structured data, flexible schema Wide-Column Cassandra, HBase Time-series, write-heavy, huge scale Graph Neo4j, Amazon Neptune Relationships are the primary query (social graph, recommendations) NewSQL
Message Queues and Event Streaming
Use a queue when:
- Email sending after user registration
- Image processing after upload
- Order fulfillment after purchase
- Any "fire and forget" task
Use event streaming when:
- Multiple services react to the same event
- You need event replay (audit log, rebuilding state)
- Real-time analytics pipeline
- Change data capture (CDC) from database API Design
REST vs GraphQL vs gRPC
REST GraphQL gRPC Protocol HTTP/1.1 HTTP/1.1 HTTP/2 Data format JSON/XML JSON Protobuf (binary) Typing Loose Strong Strong Fetching Fixed endpoints Client specifies fields Generated stubs Best for Public APIs, simple CRUD Mobile apps, variable data needs Internal microservices Performance Good Good Excellent /users, not /getUser)/v1/users)?page=2&limit=50 or cursor-based) Rate Limiting
Token Bucket:
- Bucket holds N tokens
- Each request consumes one token
- Tokens refill at fixed rate
- Allows bursts up to bucket size
- Used by: AWS, Stripe
Leaky Bucket:
- Requests enter a queue
- Queue drains at fixed rate
- Smooths out bursts
- Used by: traffic shaping
Fixed Window Counter:
- Count requests in time window
- Reset at window boundary
- Problem: burst at window boundary (2x limit possible)
Sliding Window Log:
- Store timestamp of each request
- Count requests in sliding window
- Accurate but memory-intensive
Sliding Window Counter:
- Hybrid of fixed window and sliding log
- Good accuracy, low memory
- Used by: Cloudflare Consistent Hashing
server = hash(key) % N -- adding/removing a server remaps ~N-1/N of keys.server = first server clockwise from hash(key) on ring -- adding/removing a server remaps only 1/N of keys on average. Common System Design Patterns
Fan-Out
CQRS (Command Query Responsibility Segregation)
Write side: Client --> Command Handler --> Write DB
|
Event Bus
|
Read side: Client <-- Query Handler <-- Read DB (denormalized) Event Sourcing
Traditional: Users table row: {id: 1, balance: 150}
Event Sourcing: Event log:
{type: "AccountOpened", amount: 200}
{type: "Withdrew", amount: 50}
-- current state derived by replaying events Saga Pattern
Order Service:
1. Create order (pending) -- if fails, rollback
2. Reserve inventory -- if fails, cancel order
3. Charge payment -- if fails, release inventory + cancel order
4. Update order to confirmed -- if fails, refund + release inventory + cancel order
Each step has a compensating transaction for rollback. Strangler Fig Pattern
1. New requests --> Router --> Monolith (all traffic)
2. Build new service for one feature
3. New requests --> Router --> New Service (feature X)
--> Monolith (everything else)
4. Repeat until monolith is gone Bulkhead Pattern
Without bulkheads:
API Server (one thread pool for all downstream calls)
If Recommendation Service slows down → thread pool exhausted → everything fails
With bulkheads:
API Server with separate thread pools:
Pool A (20 threads) → User Service
Pool B (20 threads) → Product Service
Pool C (5 threads) → Recommendation Service (can fail safely) Circuit Breaker
States:
CLOSED → requests pass through, failures counted
OPEN → requests fail fast (no downstream call), reset timer starts
HALF-OPEN → one test request, if success → CLOSED, if fail → OPEN
Thresholds:
Open when: failure rate > 50% in last 10 requests
Half-open after: 30 seconds Data Design Patterns
Denormalization for Read Performance
SELECT u.name, t.text, COUNT(l.id) as likes
FROM tweets t
JOIN users u ON t.user_id = u.id
LEFT JOIN likes l ON t.id = l.tweet_id
WHERE u.id = 123
GROUP BY t.id, u.name, t.text;tweets_with_author table:
{ tweet_id, text, author_name, author_avatar, like_count }
-- No join needed, just query by user_id index Time-Series Data
Raw data: keep 7 days (high precision, high volume)
1-min rollup: keep 30 days
1-hour rollup: keep 1 year
1-day rollup: keep forever Distributed Unique ID Generation
UUID v4:
- Truly random, no coordination needed
- 128 bits, not sortable, large index size
Twitter Snowflake:
- 64-bit integer
- 41 bits: timestamp (ms precision, 69 year span)
- 10 bits: machine ID
- 12 bits: sequence within millisecond
- Roughly time-sortable
Instagram-style:
- 64-bit
- 41 bits: timestamp
- 13 bits: shard ID
- 10 bits: sequence
- Encodes the shard, simplifies routing
Database-assigned (with segments):
- Central ID server issues segments (e.g., IDs 1001-2000)
- Each service burns through its segment locally
- Requests new segment when exhausted Availability and Reliability
SLA, SLO, SLI
SLI (Service Level Indicator): a metric
-- "request success rate" = successful requests / total requests
SLO (Service Level Objective): target value for an SLI
-- "99.9% of requests succeed in a rolling 30-day window"
SLA (Service Level Agreement): contract with consequences
-- "if SLO is missed, customer gets 10% credit"99.9% availability → 8.7 hours downtime/year (43 minutes/month)
99.95% → 4.4 hours/year (22 minutes/month)
99.99% → 52 minutes/year (4 minutes/month)
99.999% → 5 minutes/year (26 seconds/month) Replication Strategies
Single-leader (master-replica):
- One primary accepts writes
- Replicas receive changes asynchronously
- Read from replicas (may be slightly stale)
- Used by: MySQL, PostgreSQL, MongoDB
Multi-leader:
- Multiple nodes accept writes
- Changes sync to each other
- Conflict resolution needed
- Used by: Google Docs, CouchDB
- Good for: multi-datacenter with low cross-region write latency
Leaderless (quorum-based):
- Any node accepts writes
- Write to W nodes, read from R nodes
- W + R > N = guaranteed overlap (strong consistency)
- W + R <= N = eventual consistency (higher availability)
- Used by: Cassandra (QUORUM), DynamoDB Disaster Recovery
RTO (Recovery Time Objective): maximum acceptable downtime
RPO (Recovery Point Objective): maximum acceptable data loss
Strategy options:
Backup and Restore: RTO: hours RPO: hours Cost: low
Pilot Light: RTO: 10 min RPO: minutes Cost: medium
Warm Standby: RTO: minutes RPO: seconds Cost: medium-high
Multi-site Active: RTO: seconds RPO: ~zero Cost: high Microservices vs Monolith
When to Use a Monolith
When to Migrate to Microservices
Microservices Communication
Synchronous (request-response):
- REST, gRPC
- Simple, immediate response
- Tight coupling: caller waits, failure propagates
Asynchronous (event-driven):
- Message queues, Kafka
- Loose coupling, resilience
- More complex, eventual consistency API Gateway vs Service Mesh
API Gateway:
- North-south traffic (external clients to services)
- Authentication, rate limiting, routing, SSL
- Single ingress point
Service Mesh (Istio, Linkerd, Consul Connect):
- East-west traffic (service to service)
- Mutual TLS, circuit breaking, retries, observability
- Sidecar proxy per service pod Observability
The Three Pillars
Metrics: aggregated numeric data over time
-- Request rate, error rate, latency, CPU, memory
-- Tools: Prometheus, Datadog, CloudWatch
Logs: discrete event records
-- Application logs, access logs, audit logs
-- Tools: ELK stack, Loki + Grafana, Splunk
Traces: end-to-end request path across services
-- Shows where time is spent across microservices
-- Tools: Jaeger, Zipkin, AWS X-Ray, Tempo The RED Method (for services)
The USE Method (for resources)
SLO Alerting
Bad: "CPU > 80%" -- CPU high doesn't always mean users are impacted
Good: "p99 latency > 500ms" -- users are definitely experiencing this
Bad: "Disk usage > 70%"
Good: "Error rate > 1%" or "time to fill disk < 4 hours" Security Considerations
Authentication and Authorization
Authentication: who are you?
-- Username/password, OAuth2, SAML, magic link
Authorization: what can you do?
-- RBAC (role-based), ABAC (attribute-based), ACL
Token types:
Session token: server-side state, opaque string
+ Easy to revoke, + Less data exposure
- Does not scale without shared storage (Redis)
JWT (JSON Web Token): self-contained, stateless
+ No server-side storage needed, + Scales easily
- Cannot be revoked (use short TTL + refresh tokens)
- Larger than session token
Recommended JWT approach:
Access token: short TTL (15 minutes), used for API calls
Refresh token: longer TTL (30 days), stored httpOnly cookie
On expiry: use refresh token to get new access token Common Security Patterns
Designing Specific Systems -- Worked Examples
Design a URL Shortener (bit.ly)
Writes: 100M/day = 1,160 writes/sec
Reads: 10B/day = 115,740 reads/sec
Read:Write ratio = 100:1 (read-heavy)
Storage: 500 bytes per URL * 100M/day * 5 years = ~90 TBPOST /api/shorten body: {long_url} --> returns {short_url}
GET /{code} --> 301/302 redirectOption 1: hash(long_url) + take 7 chars
- Problem: collisions, same URL gets same code
Option 2: base62 encode a unique integer ID
- 7 chars of base62 (a-z, A-Z, 0-9) = 62^7 = 3.5 trillion codes
- Generate unique ID from distributed counter or Snowflake
- Encode as base62 stringWrite path:
Client --> API Server --> ID Generator --> DB (URL table)
--> Cache (short:long mapping)
Read path:
Client --> Load Balancer --> API Server
| |
CDN Redis Cache (hit: 99%+)
DB (miss: 1%)CREATE TABLE urls (
id BIGINT PRIMARY KEY,
short_code VARCHAR(8) UNIQUE NOT NULL,
long_url TEXT NOT NULL,
user_id BIGINT,
created_at TIMESTAMP DEFAULT NOW(),
expires_at TIMESTAMP,
click_count BIGINT DEFAULT 0
);
CREATE INDEX idx_short_code ON urls(short_code); Design a Chat System (WhatsApp)
Messages: 100B/day = 1.15M messages/sec
Storage: 100B * 100 bytes avg = 10 TB/dayHTTP polling: client polls every N seconds -- high latency, wasteful
Long polling: client holds connection open -- better but complex
WebSocket: bidirectional persistent connection -- best for chat
Server-Sent Events: one-direction push -- good for notificationsOnline user flow:
Sender --> Chat Server A --> Chat Server B --> Receiver
Offline user flow:
Sender --> Chat Server --> Message Queue --> Push Notification Service
--> Message DB (store for delivery when online)messages:
message_id BIGINT (Snowflake ID -- time sortable)
from_user_id BIGINT
to_id BIGINT (user_id for 1:1, group_id for group)
content TEXT
type ENUM(text, image, video, audio)
status ENUM(sent, delivered, read)
created_at TIMESTAMPSmall groups (< 100 members):
Fan-out on write: message copied to each member's inbox on send
Large groups (> 100 members):
Fan-out on read: single message stored, members pull on connect
WhatsApp uses: single copy + delivery tracking per member Design a Distributed Cache (Redis-like)
Clients --> Hash to shard (consistent hashing) --> Cache Node
Single node per shard:
- Simple but no HA
Primary-replica per shard:
- Primary handles writes
- Replicas handle reads + failover
Redis Cluster:
- 16,384 hash slots distributed across nodes
- Each key maps to a slot: slot = CRC16(key) % 16384
- Gossip protocol for node healthnoeviction: reject new writes
allkeys-lru: evict any key using LRU
volatile-lru: evict TTL keys using LRU (keep non-TTL keys)
allkeys-random: evict random key
volatile-ttl: evict key with shortest TTL first Design a Search Autocomplete
Words: apple, app, application, apply
Trie nodes:
root
a
p
p
(end: "app")
l
e (end: "apple")
y (end: "apply")
l
i
c
a
t
i
o
n (end: "application")
Search "app" traverses to node at "app", returns all suffixesprefix:app --> ["apple", "application", "apply", "apps", "append"] Interview Strategy
How to Perform Well in System Design Interviews
"Before I start designing, let me ask a few questions to make sure
I understand the scope...""I'll assume this is a global system with users in multiple regions.
I'll assume we need 99.99% availability.
I'll assume reads vastly outnumber writes.""We could use Redis for the session store. The tradeoff is that
Redis adds operational complexity and is a potential single point
of failure -- we'd need Redis Cluster or Sentinel. Alternatively
we could use sticky sessions at the load balancer, which is simpler
but makes rolling deployments harder." Common Mistakes
Quick Reference: Design Decision Matrix
Need Solution High read volume Add cache (Redis), read replicas High write volume Sharding, message queue, write batching Low latency reads Cache at multiple levels, CDN for static Data consistency across services Saga pattern, 2PC (avoid if possible) Service resilience Circuit breaker, bulkhead, retry with backoff Real-time communication WebSockets, Server-Sent Events Large file storage Object storage (S3, GCS), CDN for delivery Full-text search Elasticsearch, Solr, Typesense Time-series metrics InfluxDB, TimescaleDB, Victoria Metrics Graph queries Neo4j, Amazon Neptune, or adjacency list in PostgreSQL Global distribution Multi-region deployment, Anycast DNS, CRDTs Audit trail / event history Event sourcing, append-only log Schema flexibility Document DB (MongoDB) or JSONB in PostgreSQL Complex event processing Kafka Streams, Flink, Spark Streaming Unique IDs at scale Snowflake IDs, UUID v7 Rate limiting at scale Token bucket in Redis Resources