Faults and Partial Failures
- Single-machine failures are typically binary (working or completely down).
- In distributed systems, components may fail independently, causing complex failure modes.
- Examples: network partitions, hardware failures, slow responses.
Cloud Computing vs. Supercomputing
| Aspect | Supercomputing | Cloud Computing | |--------|---------------|-----------------| | Failure Handling | Entire system stops and restarts | Some parts fail, others keep running | | Hardware | High-reliability, specialized | Commodity, higher failure rates | | Use Case | Scientific computing | Web services, scalable apps |
Cloud systems must tolerate partial failures and implement redundancy, failover mechanisms, and automated recovery.
Building Reliable Systems from Unreliable Components
- Error-Correcting Codes (ECC): Recover corrupted data.
- TCP: Ensures packet reliability despite unreliable IP.
- Distributed Databases: Replication and consensus (Paxos, Raft) provide availability and consistency.
Best Practices for Handling Failures
- Expect Failures: Assume nodes will fail.
- Use Redundancy: Replicate data and services.
- Design for Partial Failures: Avoid system-wide crashes.
- Monitor & Detect Failures: Health checks and alerts.
- Test Failure Scenarios: Use Chaos Engineering.
Unreliable Networks in Distributed Systems
Why Networks Are Unreliable
- Packets may be lost, delayed, or duplicated.
- Network partitions and congestion introduce uncertainty.
Handling Network Failures
- Timeouts: Assume failure after a set delay.
- Explicit Error Messages: Use TCP RST, ICMP messages.
- Retries: Be mindful of duplicate operations.
- Leader Election: Handle leader failures gracefully.
- Network Partitions: Use quorum-based decision-making.
Best Practices
- Assume failures will happen.
- Monitor network health.
- Test network failures (Chaos Engineering).
- Design idempotent operations to handle duplicate requests.
Timeouts and Unbounded Delays
Challenges
- Fixed timeouts can be too long (slow detection) or too short (false positives).
- Network congestion, TCP retransmissions, and virtualization add unpredictability.
Solutions
- Adaptive Timeouts: Adjust dynamically.
- Exponential Backoff: Increase delay between retries.
- Phi Accrual Failure Detection: Used in systems like Cassandra.
Unreliable Clocks in Distributed Systems
Problems with Clocks
- Clock Drift: Different machines have unsynchronized clocks.
- NTP Issues: Network delays affect synchronization.
- Leap Seconds: Can cause unexpected behavior.
Best Practices
- Use Monotonic Clocks for measuring time intervals.
- Use Logical Clocks (Lamport, Vector Clocks) instead of physical time.
- Fencing Tokens prevent outdated clients from modifying data.
Process Pauses and Leader Election
Common Causes
- Garbage Collection (GC) Pauses
- Virtual Machine (VM) Suspension
- CPU Load and Slow I/O
Mitigation Strategies
- Avoid using system clocks for leases.
- Implement leader resignation protocols.
- Monitor and restart nodes after long pauses.
Knowledge, Truth, and Lies in Distributed Systems
The Role of Quorums
- Majority votes ensure correctness in failure scenarios.
- Nodes may incorrectly assume they are still active after a failure.
The Split-Brain Problem
- A node may think it's still the leader while another has taken over.
- Fencing Tokens prevent outdated clients from corrupting data.
Byzantine Faults
Understanding Byzantine Faults
- Nodes may fail maliciously, not just crash.
- Used in blockchains and aerospace systems.
Ensuring Correctness
| Model | Assumptions | |--------|------------| | Synchronous | Fixed limits on network delays | | Partially Synchronous | Mostly behaves like synchronous, occasional outliers | | Asynchronous | No assumptions about timing |
Best Practices
- Assume nodes can fail unpredictably.
- Use Quorum-based voting to avoid split-brain scenarios.
- Design for safety (correctness) and liveness (availability).