Updated 8 hours ago
A distributed system is a collection of independent computers that appear to users as a single coherent system. They communicate over a network to coordinate their actions and share resources, but each maintains its own local state and can fail independently.
Your web application is almost certainly a distributed system. It runs on multiple servers, talks to separate database servers, uses a caching layer, queues work to background jobs, and calls third-party APIs. All of these components working together—that's a distributed system.
Distributed systems are hard for a specific reason: you can never know the truth. You can't know if a message arrived. You can't know if a node is dead or just slow. You can't even know what time it is. This fundamental uncertainty changes everything.
Why Distribute at All?
Scalability: A single machine has limits. Distributing work across many machines lets you handle more requests, store more data, and process more computations than any single computer ever could.
Reliability: If everything runs on one machine, hardware failure brings down everything. Distributed systems can keep running when individual components fail.
Geography: Placing servers near users reduces latency. European users connect to European servers, avoiding hundreds of milliseconds of intercontinental delay.
Cost: Many inexpensive commodity servers often cost less than one very powerful specialized machine.
The Fundamental Challenges
These problems don't exist—or are trivial—in single-machine systems.
Network Failures
In a single machine, function calls work or crash the whole program. In distributed systems, network calls fail in countless ways:
- Complete failure: The network is down. Messages never arrive.
- Partial failure: Messages arrive at some components but not others, creating inconsistent states.
- Slow networks: Messages arrive but take much longer than expected. Is it failed or just slow? You can't tell.
- Message reordering: Messages sent in one order arrive in a different order.
- Message duplication: The same message arrives multiple times.
Applications must handle all of these gracefully.
Independent Failures
Components fail independently. One server runs perfectly while another crashes. The database is available but the cache is down.
This creates scenarios impossible in single-machine systems: half your application servers work while half don't. The database is up but unreachable due to network issues between you and it.
No Global Clock
In single-machine systems, you check the time and trust it. In distributed systems, each machine has its own clock, and these clocks drift apart.
Clock skew means different machines disagree about the current time. Even with NTP synchronization, clocks might differ by tens or hundreds of milliseconds.
Time can go backward. When clocks are corrected (NTP adjustments, leap seconds), time jumps backward, breaking any assumption that timestamps always increase.
This makes "did A happen before B?" surprisingly difficult to answer.
Partial Knowledge
No component has complete knowledge of the system's state. Each only knows what it's directly observed or been told.
If you don't receive a response, you don't know whether:
- Your message never arrived
- The component crashed before responding
- The component processed your request but the response was lost
- The response is delayed and will arrive soon
This uncertainty isn't a bug to fix—it's the permanent condition of distributed computing.
Consistency, Availability, and the Tradeoff
Consistency
Consistency means all nodes agree on the current state of data. After writing, when do reads reflect that write?
Strong consistency: Reads immediately see all committed writes. All nodes always agree.
Eventual consistency: Writes propagate to all nodes eventually, but there's a window where different nodes have different data. Given enough time with no new writes, everything converges.
Causal consistency: Causally related operations appear in the same order everywhere, but independent operations might appear in different orders.
Stronger guarantees cost more—usually in latency or availability.
Availability
Availability means the system responds to requests. A highly available system keeps functioning when components fail.
The CAP Theorem
Networks can partition—split into groups that can't communicate with each other. This will happen. The system must handle it.
The CAP theorem states that during partitions, you must choose between consistency and availability. You cannot have both.
If you require consistency, some requests must fail during partitions (to prevent inconsistent responses). If you require availability, you must accept potentially inconsistent responses during partitions.
Common Patterns
Replication
Storing the same data on multiple nodes provides redundancy. If one fails, others have the data.
Primary-replica: One node accepts writes, others serve reads. Simple, but the primary is a single point of failure for writes.
Multi-primary: Multiple nodes accept writes. Better availability, but requires conflict resolution when the same data is modified on different primaries.
Quorum: Operations require agreement from a majority of nodes. Strong consistency while tolerating minority failures.
Partitioning (Sharding)
Dividing data across nodes so each handles a subset. This scales capacity beyond single-machine limits.
Hash-based: Determine which node owns data by hashing a key. Even distribution, but range queries become difficult.
Range-based: Nodes handle specific ranges (A-F, G-M, N-Z). Enables range queries but risks uneven distribution.
Directory-based: A lookup service tracks which node holds what. Flexible, but the directory can become a bottleneck.
Consensus
Reaching agreement among nodes about some value. Fundamental for coordination, but challenging when nodes can fail or become unreachable.
Two-phase commit (2PC): All nodes commit a transaction or all abort. Strong consistency, but blocks if the coordinator fails.
Paxos and Raft: Consensus algorithms that tolerate failures and guarantee eventual agreement. More complex than 2PC, but more robust.
Load Balancing
Distributing work across nodes to utilize capacity and prevent overload.
Round robin: Each request goes to the next node in sequence. Simple but ignores node capacity and current load.
Least connections: Requests go to the node with fewest active connections. Better balance but requires tracking state.
Consistent hashing: Adding or removing nodes minimally disrupts distribution. Useful for caching.
Communication Patterns
Request-Response (Synchronous)
Client sends a request, waits for a response. Simple and familiar, but creates tight coupling—the client blocks until the response arrives or times out.
Message Queues (Asynchronous)
Producers send messages to queues, consumers process them independently. Decouples components—producers don't wait for consumers.
Queues provide buffering (handle load spikes), persistence (messages survive crashes), and parallelism (multiple consumers process concurrently).
Publish-Subscribe
Publishers send messages to topics. All subscribers to those topics receive them. One-to-many communication without tight coupling.
RPC (Remote Procedure Call)
Making network calls look like local function calls. This abstracts communication but can hide latency and failure modes, sometimes making systems harder to reason about.
Failure Handling
Timeouts
Every network operation must have a timeout. Without one, a slow or failed service causes your application to hang indefinitely.
Choosing timeout values is hard: too short and you give up on slow-but-working operations. Too long and you wait too long to detect failures.
Retries
When operations fail, retrying often succeeds. Network blips and transient failures resolve quickly.
Retries require care:
- Idempotency: Can you safely retry? Retrying "transfer $100" twice might transfer $200.
- Exponential backoff: Wait longer between each retry to avoid overwhelming failing services.
- Limits: Eventually give up rather than retrying forever.
Circuit Breakers
When a service is failing, stop sending requests to it temporarily. This prevents wasting time on requests that will fail and gives the service time to recover.
After a cooldown period, try again. If requests succeed, resume normal operation. If they fail, open the circuit again.
Graceful Degradation
Design systems to function (with reduced capability) when components fail.
If recommendations are down, show recent items. If image upload fails, accept text-only posts. If search is unavailable, fall back to simpler filtering.
Monitoring and Observability
Distributed systems are complex. Understanding what's happening requires comprehensive monitoring:
Metrics: Quantitative measurements—request rates, error rates, latencies, resource usage.
Logs: Detailed records of events, useful for debugging specific issues.
Traces: Following individual requests through multiple services, revealing bottlenecks and failures.
Without good observability, debugging distributed systems approaches impossible.
The Distributed Systems Mindset
Embrace failure: Don't try to prevent all failures. Design for graceful handling of inevitable ones.
Think about time: Don't assume clocks are synchronized. Don't assume events have a single definitive order.
Question guarantees: "The message was sent" doesn't mean "the message arrived." "The service is up" doesn't mean "the service is reachable from here."
Test failure scenarios: Test network failures, slow responses, partial failures, cascading failures—not just happy paths.
Start simple: Distributed systems are complex. Start with the simplest design that works. Add complexity only when necessary.
Distributed systems are powerful but inherently uncertain. Understanding that uncertainty—really feeling it—is the first step to building systems that are reliable despite it.
Frequently Asked Questions About Distributed Systems
Was this page helpful?