System Design Interview Questions: Practice for Every Level

Reviewed by Mark Dickie · Last updated

System design is the process of defining the architecture, components, data flow, and trade-offs of a software system to meet a set of functional and non-functional requirements. Interviews on this topic test whether you can think at scale — balancing consistency, availability, and performance under real constraints. You should know how to scope a problem before proposing a solution, justify your technology choices with concrete trade-offs, and talk through failure modes without being prompted. Interviewers at every level listen for structured thinking as much as for correct answers.

What the interview actually covers

System design questions span a wide surface area. The table below maps the main topic groups to the concepts most likely to come up in a live interview.

| Topic area | Key concepts to know | |---|---| | Scalability | Horizontal vs. vertical scaling, load balancing, stateless services | | Data storage | SQL vs. NoSQL trade-offs, sharding, replication, CAP theorem | | Caching | Cache-aside vs. write-through, eviction policies, CDN placement | | Messaging & async | Message queues, pub/sub, at-least-once vs. exactly-once delivery | | Networking | DNS resolution, HTTP/2, WebSockets, REST vs. gRPC | | Reliability | Rate limiting, circuit breakers, retries with exponential back-off | | Estimation | Back-of-envelope math for QPS, storage, and bandwidth |

How difficulty scales across levels

The five difficulty tiers on this page correspond to what interviewers expect at different career stages. Knowing which tier you are targeting helps you set the right depth for your answers.

  1. Level 1 — Fundamentals. Client–server model, basic CRUD architecture, a single database, and simple caching. Suitable for new-grad and junior roles.
  2. Level 2 — Core patterns. Load balancers, relational vs. document stores, CDN usage, and introductory API design. Expected at mid-level interviews.
  3. Level 3 — Distributed systems basics. Consistent hashing, database indexing strategies, message queues, and read/write replicas. Common at senior-level rounds.
  4. Level 4 — Advanced trade-offs. Consensus protocols (Raft, Paxos at a conceptual level), distributed transactions, multi-region replication, and capacity estimation under tight constraints. Staff-engineer territory.
  5. Level 5 — Principal/architect depth. End-to-end design of systems at the scale of millions of requests per second, including data-pipeline architecture, global failover strategy, and cost/reliability trade-offs across cloud regions.

How to structure any answer

Regardless of difficulty, experienced interviewers expect you to follow a consistent approach. Jumping straight to "I'd use Kafka" before clarifying requirements is one of the most common ways candidates lose points.

  1. Clarify requirements — ask about scale, read/write ratio, latency targets, and consistency needs before drawing anything.
  2. Estimate scale — do a quick back-of-envelope calculation for QPS, storage growth, and bandwidth to size the system.
  3. Sketch a high-level design — start with the fewest components that actually work; add complexity only when a bottleneck demands it.
  4. Go deep on one component — interviewers often pick one area (the database layer, the cache, the queue) and push hard; be ready to defend your choices.
  5. Identify failure modes — discuss what breaks first under load and how you would detect or recover from it.

The quiz below covers all five levels. Filter by difficulty to focus your preparation on the tier that matches your target role.

At a glance

Questions15
Difficulty2–5 of 5
FormatsMultiple choice, Multiple answer, Flashcard, Ordering

What you'll review

  1. horizontal vertical
  2. caching strategies
  3. cap consistency
  4. sharding
  5. replication
  6. rate limiting
  7. circuit breaker
  8. observability sd
  9. message queues
  10. idempotency keys
  11. sql vs nosql
  12. availability
  13. cdn edge
  14. load balancing
  15. scalability

Practice questions

System Design/sd-fundamentals/horizontal-vertical

Your API runs on a fleet behind a round-robin load balancer, but users intermittently get logged out as requests land on different nodes. The team stores session state in each node's local memory. What is the single most important change to enable safe horizontal scaling?

Options

  • Move session state out of the app process into a shared store (e.g. Redis) so any node can serve any request
  • Enable sticky sessions on the load balancer and keep state in local memory
  • Scale each node vertically with more RAM so sessions never get evicted
  • Add a CDN in front of the API to cache the session responses
Show answer

Move session state out of each node's local memory into a shared store such as Redis, so any node can serve any request. The nodes are stateful today — each holds session data only it can see — which is why round-robin routing logs users out. Externalising state makes nodes interchangeable, the precondition for safe horizontal scaling, rolling deploys, and failover.

Why:

The root problem is that the app nodes are stateful: each holds session data only it can see. Externalising state to a shared store makes the nodes interchangeable, which is the precondition for horizontal scaling and for free rolling deploys and failover. Sticky sessions are a workaround, not a fix — they pin users to a node, so a deploy or crash still drops their session and they defeat even load distribution. Vertical scaling (more RAM) raises the ceiling but keeps the state trapped on one box, so the cross-node inconsistency remains. A CDN caches public, cacheable responses; per-user session data is neither, so it does nothing here.

System Design/sd-fundamentals/caching-strategies

A read-heavy product catalog uses a cache in front of the database. The team wants fresh reads immediately after a write with the lowest steady-state read latency, accepting slightly slower writes. Which caching strategy best fits?

Options

  • Cache-aside (lazy loading) with no write-side update
  • Write-through: write to the cache and the database synchronously on every write
  • Write-behind (write-back): acknowledge the write from cache and flush to the DB asynchronously
  • TTL-only caching with a 60-second expiry and no write path
Show answer

Use a write-through cache: write to the cache and the database synchronously on every write. The just-written value is already in cache for the next read, giving fresh reads immediately at the cost of a slightly slower write — exactly the trade described. Cache-aside only populates on a read miss, and TTL or write-behind strategies both permit a window of staleness.

Why:

Write-through updates the cache and the database in the same write path, so the just-written value is already in cache for the next read — you get fresh reads immediately at the cost of a slower write, exactly the trade the team accepted. Plain cache-aside only populates the cache on a read miss, so the entry written can be stale until the next miss (and a delete-on-write is needed to avoid serving the old value). Write-behind acknowledges before the DB is durable, trading consistency and durability for write speed — the opposite of the stated priority. TTL-only caching guarantees staleness for up to the TTL window, which violates the "fresh immediately" requirement.

System Design/sd-data/cap-consistency

Per the CAP theorem, during an active network partition a distributed datastore must choose between two properties. Which pair describes the real choice it faces?

Options

  • Consistency or Availability — partition tolerance is a given for any networked system
  • Consistency or Partition tolerance — availability is always preserved
  • Availability or Partition tolerance — consistency is always preserved
  • Latency or Durability — CAP is purely about write performance
Show answer

The real choice is Consistency or Availability — partition tolerance is a given for any networked system. CAP says that when a partition occurs, a system can preserve either linearizable consistency or availability, not both. You cannot opt out of partitions in a real distributed system, so P is assumed and the live decision is C-versus-A.

Why:

CAP says that when a partition occurs, a system can preserve either linearizable Consistency or Availability, not both. Partition tolerance is not optional in a real distributed system — networks drop packets and nodes fail — so P is assumed, and the live decision is C-vs-A. The other framings misstate this: you cannot "choose" to forgo partition tolerance and still be distributed, and consistency is precisely the property at risk, not the one that is guaranteed. Latency/durability describes PACELC's else-clause and write tuning, which is a separate axis from the partition-time CAP choice.

System Design/sd-data/sharding

You are sharding a high-write events table across many partitions. Using created_at (a monotonically increasing timestamp) as the shard key causes one shard to absorb nearly all writes. What is the best remedy?

Options

  • Shard on a high-cardinality key such as a hash of the tenant/entity id so writes spread evenly
  • Add more replicas to the busy shard so it can keep up
  • Increase the shard count but keep created_at as the key
  • Switch the busy shard to a larger instance type (vertical scale)
Show answer

Shard on a high-cardinality key, such as a hash of the tenant or entity id, so writes spread evenly across partitions. A monotonic timestamp routes every current write to whichever shard owns the latest range, creating a hot partition no matter how many shards exist. Adding replicas or vertically scaling the busy shard only raises one box's ceiling; neither balances the write distribution.

Why:

A monotonic timestamp routes all current writes to whichever shard owns the latest range, creating a hot partition no matter how many shards exist. Choosing a high-cardinality key (a hash of tenant or entity id) distributes writes uniformly, which is the actual fix. Adding replicas helps read load but not the write hotspot — replicas still funnel writes through one leader. Raising the shard count without changing the key leaves the newest range concentrated on a single shard. Vertically scaling the hot shard only raises its ceiling and reintroduces a single point of contention; it does not balance the distribution.

System Design/sd-data/replication

You offload reads to asynchronous read replicas. Right after a user updates their profile and is redirected to view it, they sometimes see the old data. Which approach fixes this read-your-writes problem with the least global cost?

Options

  • Route that user's reads to the primary for a short window after their write
  • Make all reads go to the primary and remove the replicas
  • Switch replication from asynchronous to fully synchronous for every replica
  • Add a 5-second client-side delay before showing the profile page
Show answer

Route that one user's reads to the primary for a short window after their write. The symptom is replication lag — the async replica hasn't applied the write yet — so pinning just that user to the primary gives them read-your-writes consistency while everyone else keeps the replica offload. Sending all reads to the primary or forcing synchronous replication is far more expensive than the problem warrants.

Why:

The symptom is replication lag: an async replica hasn't applied the user's write yet. Pinning that specific user's reads to the primary for a brief post-write window gives them read-your-writes consistency while everyone else still benefits from replica offload — minimal global cost. Sending all reads to the primary throws away the scaling benefit the replicas existed to provide. Forcing fully synchronous replication makes every write wait on every replica, hurting write latency and availability far beyond what this problem warrants. A fixed client-side delay is a guess — lag is variable, so it is both unreliable and a poor user experience.

System Design/sd-patterns/rate-limiting

An API must allow clients a sustained rate of 10 req/s but also tolerate short bursts (e.g. a client firing 50 requests after being idle). Which rate-limiting algorithm naturally supports this?

Options

  • Token bucket
  • Leaky bucket (as a queue with a fixed drain rate)
  • Fixed window counter
  • A hard concurrency semaphore of 1
Show answer

Use a token bucket. It refills tokens at the sustained rate and lets unused tokens accumulate up to the bucket size, so an idle client banks capacity and can spend it in a burst — exactly the behaviour required. A leaky bucket deliberately flattens bursts to a fixed drain rate, and a fixed-window counter is bursty across boundaries in the wrong way.

Why:

Token bucket refills tokens at the sustained rate and lets unused tokens accumulate up to the bucket size, so an idle client banks capacity and can spend it in a burst — exactly the behaviour required. A leaky bucket smooths output to a fixed drain rate, deliberately flattening bursts, which is the opposite of what we want. A fixed window counter is bursty in the wrong way: it permits double the limit across a window boundary while still rejecting legitimate bursts inside a window. A concurrency semaphore of 1 limits parallelism, not request rate, and would serialise the client rather than rate-limit it.

System Design/sd-patterns/circuit-breaker

A downstream payment provider starts timing out. Your service keeps retrying, threads pile up waiting on it, and the whole service becomes unresponsive. Which pattern most directly prevents this cascading failure?

Options

  • A circuit breaker that trips open after a failure threshold and fails fast until the dependency recovers
  • Increasing the request timeout so slow calls have more time to succeed
  • Unbounded retries with a fixed short delay between attempts
  • Adding more application instances to absorb the blocked threads
Show answer

Add a circuit breaker that trips open after a failure threshold and fails fast until the dependency recovers. It detects a run of failures, short-circuits subsequent calls instead of letting them stack up on a dead dependency, then probes for recovery via a half-open state — directly stopping the thread exhaustion. Raising timeouts or adding retries or instances only feeds the cascade.

Why:

A circuit breaker detects a run of failures, trips to the open state, and short-circuits subsequent calls — failing fast instead of letting requests stack up on a dead dependency — then probes for recovery via a half-open state. That directly stops the thread exhaustion and cascade. Raising the timeout makes things worse: each call holds a thread longer, accelerating exhaustion. Unbounded fixed-delay retries amplify load on an already-struggling dependency (a retry storm) and prevent it from recovering. Adding instances just provides more threads to be consumed by the same stuck calls, scaling the failure rather than containing it.

System Design/sd-reliability/observability-sd

A request that fans out across six microservices is intermittently slow, and per-service dashboards each look healthy in isolation. Which observability signal is purpose-built to locate where the latency is spent across that single request's path?

Options

  • Distributed tracing with a propagated trace/span context across all services
  • Per-service aggregate latency metrics (p50/p95 counters)
  • Structured logs grepped independently on each service
  • An uptime health-check endpoint polled every 30 seconds
Show answer

Distributed tracing with a propagated trace/span context across all services. It stitches together the spans of one request as it crosses service boundaries, so you can see exactly which hop consumed the time — the precise need when each service looks healthy on its own. Aggregate metrics average away the slow request, and per-service logs can't reconstruct one request's journey without a shared trace id.

Why:

Distributed tracing stitches together the spans of one request as it crosses service boundaries via a propagated trace context, so you can see exactly which hop consumed the time — the precise need when each service looks fine on its own. Aggregate metrics tell you that p95 is high but average away the individual slow request and cannot attribute latency across the chain. Logs are per-service and unordered relative to each other; without a shared trace id you cannot reliably reconstruct one request's journey across all six. A health check only answers up/down liveness and says nothing about per-request latency distribution. Metrics, logs, and traces are complementary, but tracing is the one built for cross-service request attribution.

System Design/sd-patterns/message-queues

Your checkout flow synchronously calls an email service, an analytics pipeline, and a fraud-scoring job, and a slow dependency now blocks orders. You introduce a message queue so checkout publishes events and workers consume them. Which benefits does this asynchronous decoupling genuinely provide?

Options

  • Checkout no longer blocks on slow consumers — it returns once the event is enqueued
  • The queue absorbs traffic spikes, buffering work so consumers can drain at their own pace
  • A temporarily down consumer can recover and process backlog without losing events
  • It guarantees end-to-end latency is lower than the synchronous version for every request
  • It removes the need for consumers to handle duplicate deliveries
Show answer

The real benefits are that checkout no longer blocks on slow consumers (it returns once the event is enqueued), the queue absorbs traffic spikes and buffers work so consumers drain at their own pace, and a temporarily down consumer can recover and process the backlog without losing events. It does not guarantee lower per-request end-to-end latency, and it does not remove the need to handle duplicate deliveries.

Why:

A queue decouples producers from consumers: checkout returns as soon as the event is enqueued (a), the queue acts as a buffer that smooths spikes so consumers process at a sustainable rate (b), and durable queues retain messages so a consumer that was down can drain the backlog on recovery (c). The two wrong options reflect common misconceptions. Async processing does not lower per-request end-to-end latency (d) — the downstream work still happens, just later; you trade latency-to-completion for responsiveness and resilience. And most queues offer at-least-once delivery, so consumers must be idempotent to tolerate duplicates (e); the queue does not remove that obligation.

System Design/sd-patterns/idempotency-keys

A mobile client retries a POST /payments when the network drops, risking a double charge. You add idempotency keys. Which statements about implementing them correctly are true?

Options

  • The client generates a unique key per logical operation and resends the same key on retries
  • The server persists the key with the operation's result and replays the stored result on a repeated key
  • Idempotency keys are most valuable for non-idempotent verbs like POST, where natural retries are unsafe
  • Generating a fresh key on each retry attempt is the recommended approach
  • Idempotency keys are unnecessary because TCP already guarantees exactly-once delivery
Show answer

Three statements are correct: the client generates a unique key per logical operation and resends the same key on retries, the server persists the key with the operation's result and replays the stored result on a repeat, and idempotency keys are most valuable for non-idempotent verbs like POST. Generating a fresh key per retry defeats the mechanism, and TCP guarantees reliable bytes, not application-level exactly-once.

Why:

Idempotency keys work because the client mints one key per logical operation and reuses it across retries (a), and the server records that key alongside the result so a repeat presents the same outcome instead of executing twice (b). They matter most for non-idempotent methods like POST (c) — GET/PUT/DELETE are already idempotent by definition, so a blind retry of those is safe. The wrong options break the mechanism: minting a new key per attempt (d) defeats the whole point — the server sees each retry as a distinct operation and double-charges. And TCP guarantees reliable, ordered bytes within a connection, not application-level exactly-once semantics across reconnects and timeouts (e); that is exactly the gap idempotency keys fill.

System Design/sd-data/sql-vs-nosql

You are choosing between a relational database and a document/wide-column NoSQL store for a new service. Which considerations correctly favour reaching for NoSQL over a single relational primary?

Options

  • Access patterns are known and key-based, and you need predictable single-digit-ms reads at very high scale
  • The schema is flexible/evolving and writes are massive, with horizontal partitioning built in
  • You can model the workload to avoid cross-entity joins and tolerate eventual consistency
  • The core requirement is multi-row ACID transactions spanning several related tables
  • You rely heavily on ad-hoc analytical queries with complex joins and aggregations
Show answer

NoSQL is favoured when access patterns are known and key-based with a need for predictable single-digit-ms reads at very high scale, when the schema is flexible and writes are massive with horizontal partitioning built in, and when you can model the workload to avoid cross-entity joins and tolerate eventual consistency. Multi-row ACID transactions across related tables and ad-hoc analytical queries with complex joins are signals to stay relational.

Why:

NoSQL shines when access is key-based and predictable at scale (a), when the schema is fluid and write volume is huge with native horizontal partitioning (b), and when you can denormalise to avoid joins and accept eventual consistency (c). The remaining two are signals to stay relational. Multi-row ACID transactions across related tables (d) are exactly what a relational engine guarantees cleanly, whereas many NoSQL stores limit transactions to a single partition or item. Ad-hoc analytical queries with complex joins and aggregations (e) are the relational/SQL sweet spot; forcing them onto a key-value or document model leads to expensive scatter-gather or duplicated denormalised data. The choice is driven by access patterns and consistency needs, not by one store being universally "better."

System Design/sd-reliability/availability

You are hardening a service toward higher availability and want to eliminate single points of failure. Which design choices meaningfully improve availability through redundancy and failover?

Options

  • Run stateless app instances across multiple availability zones behind a health-checking load balancer
  • Use a replicated database with automated leader failover (promote a follower on primary loss)
  • Add health checks so the load balancer stops routing to unhealthy instances automatically
  • Deploy every component into a single availability zone to minimize cross-zone latency
  • Run a single oversized, highly reliable instance instead of several smaller ones
Show answer

Availability comes from redundancy with automatic failover: run stateless app instances across multiple availability zones behind a health-checking load balancer, use a replicated database with automated leader failover, and add health checks so the balancer stops routing to unhealthy instances. Deploying everything into a single AZ, or running one oversized instance, each reintroduces a single point of failure that takes the whole service down when it fails.

Why:

Availability comes from redundancy with automatic failover and removing single points of failure. Spreading stateless instances across multiple AZs behind a health-checking balancer (a) survives an instance or whole-zone outage; a replicated DB with automated leader promotion (b) removes the database as a single point of failure; and health checks that eject unhealthy nodes (c) keep traffic flowing only to working capacity. The two wrong options reduce availability. Putting everything in one AZ (d) trades resilience for latency — a single zone outage takes the whole service down. A single oversized instance (e) is the textbook single point of failure no matter how reliable the box: when it dies, you have zero capacity, which is why N+1 redundancy beats one big node.

System Design/sd-fundamentals/cdn-edge

Your global app serves large static assets (images, JS bundles, video) and origin egress plus latency are hurting. You put a CDN / edge cache in front. Which outcomes are correct expectations of this change?

Options

  • Cached assets are served from edge PoPs close to users, cutting round-trip latency
  • Origin load and egress drop because the CDN absorbs repeat requests for cacheable content
  • Cache-Control / TTL headers govern how long edges serve content before revalidating with origin
  • Highly personalized, per-user dynamic API responses become trivially cacheable at the edge
  • A bad deploy is harmless because edges instantly reflect every origin change with no invalidation needed
Show answer

The correct expectations are that cached assets are served from edge PoPs close to users (cutting round-trip latency), origin load and egress drop because the CDN absorbs repeat requests for cacheable content, and Cache-Control/TTL headers govern how long edges serve content before revalidating. Highly personalized per-user responses are not trivially cacheable at a shared edge, and edges do not instantly reflect origin changes — stale objects live until TTL or an explicit purge.

Why:

A CDN caches content at edge points of presence near users (a), so it lowers latency and offloads repeat requests from the origin, cutting load and egress (b), with freshness controlled by Cache-Control/TTL and revalidation (c). The wrong options are classic edge-caching traps. Personalized, per-user dynamic responses are generally not cacheable at a shared edge (d) — caching them risks leaking one user's data to another, so they need private/no-store handling or edge-compute personalization, not naive caching. And edges do not instantly reflect origin changes (e): cached objects live until their TTL or an explicit purge/invalidation, which is precisely why a bad asset deploy can keep serving stale content until you invalidate the cache or bust the URL.

System Design/sd-fundamentals/load-balancing

What is the difference between an L4 and an L7 load balancer, and when would you choose each?

Show answer

An L4 load balancer routes on transport-layer info (IP and TCP/UDP port) without inspecting payload, so it is extremely fast and protocol-agnostic. An L7 load balancer terminates the connection and routes on application-layer data — HTTP path, headers, cookies, host — enabling content-based routing, sticky sessions, TLS termination, and per-route rate limiting. Choose L4 for raw throughput and non-HTTP traffic; choose L7 when you need smart routing, observability, or to split traffic by URL/service.

Why:

The two layers are often combined: an L4 balancer fronts a fleet of L7 proxies for scale, while the L7 tier handles routing logic. The cost of L7 is the per-request parsing and connection termination overhead, which is why latency-critical or non-HTTP paths sometimes stay at L4.

System Design/sd-fundamentals/scalability

A service is degrading under a sustained traffic surge. Order the scaling response from cheapest/fastest to most invasive.

Put these in order

  • Confirm the bottleneck via metrics/dashboards (CPU, latency, queue depth)
  • Add a cache / raise cache TTLs to shed repeat reads off the backend
  • Horizontally scale out the stateless app tier behind the load balancer
  • Scale the data tier — add read replicas, then shard the database
Show answer

Scale from cheapest and fastest to most invasive, measuring before acting:

  1. Confirm the bottleneck via metrics/dashboards (CPU, latency, queue depth).
  2. Add a cache or raise cache TTLs to shed repeat reads off the backend.
  3. Horizontally scale out the stateless app tier behind the load balancer.
  4. Scale the data tier — add read replicas, then shard the database.
Why:

Always measure before acting so you scale the actual bottleneck, not a guess. Caching is the cheapest lever because it removes work entirely; adding stateless app instances is straightforward when the tier holds no session state; resharding the data tier is last because it is the most disruptive and hardest to reverse. The ordering reflects rising blast radius and engineering cost.

Related interview questions

Job market

See system-design salaries and hiring demand from live job postings.

Practice this for real

CodePrep turns your target job description into an adaptive quiz from a bank of tagged questions, scores your answers, and resurfaces the topics you miss.

New topics and job-market signal, in your inbox

Occasional updates — new question topics, launch news, and what the developer job market is hiring for. Confirm your email to join, and unsubscribe anytime.