DDIA Reading Notes - Chapter 8

Feb 19, 2025

Faults and Partial Failures

  • Single-machine failures are typically binary (working or completely down).
  • In distributed systems, components may fail independently, causing complex failure modes.
  • Examples: network partitions, hardware failures, slow responses.

Cloud Computing vs. Supercomputing

| Aspect | Supercomputing | Cloud Computing | |--------|---------------|-----------------| | Failure Handling | Entire system stops and restarts | Some parts fail, others keep running | | Hardware | High-reliability, specialized | Commodity, higher failure rates | | Use Case | Scientific computing | Web services, scalable apps |

Cloud systems must tolerate partial failures and implement redundancy, failover mechanisms, and automated recovery.

Building Reliable Systems from Unreliable Components

  • Error-Correcting Codes (ECC): Recover corrupted data.
  • TCP: Ensures packet reliability despite unreliable IP.
  • Distributed Databases: Replication and consensus (Paxos, Raft) provide availability and consistency.

Best Practices for Handling Failures

  1. Expect Failures: Assume nodes will fail.
  2. Use Redundancy: Replicate data and services.
  3. Design for Partial Failures: Avoid system-wide crashes.
  4. Monitor & Detect Failures: Health checks and alerts.
  5. Test Failure Scenarios: Use Chaos Engineering.

Unreliable Networks in Distributed Systems

Why Networks Are Unreliable

  • Packets may be lost, delayed, or duplicated.
  • Network partitions and congestion introduce uncertainty.

Handling Network Failures

  1. Timeouts: Assume failure after a set delay.
  2. Explicit Error Messages: Use TCP RST, ICMP messages.
  3. Retries: Be mindful of duplicate operations.
  4. Leader Election: Handle leader failures gracefully.
  5. Network Partitions: Use quorum-based decision-making.

Best Practices

  • Assume failures will happen.
  • Monitor network health.
  • Test network failures (Chaos Engineering).
  • Design idempotent operations to handle duplicate requests.

Timeouts and Unbounded Delays

Challenges

  • Fixed timeouts can be too long (slow detection) or too short (false positives).
  • Network congestion, TCP retransmissions, and virtualization add unpredictability.

Solutions

  • Adaptive Timeouts: Adjust dynamically.
  • Exponential Backoff: Increase delay between retries.
  • Phi Accrual Failure Detection: Used in systems like Cassandra.

Unreliable Clocks in Distributed Systems

Problems with Clocks

  • Clock Drift: Different machines have unsynchronized clocks.
  • NTP Issues: Network delays affect synchronization.
  • Leap Seconds: Can cause unexpected behavior.

Best Practices

  • Use Monotonic Clocks for measuring time intervals.
  • Use Logical Clocks (Lamport, Vector Clocks) instead of physical time.
  • Fencing Tokens prevent outdated clients from modifying data.

Process Pauses and Leader Election

Common Causes

  • Garbage Collection (GC) Pauses
  • Virtual Machine (VM) Suspension
  • CPU Load and Slow I/O

Mitigation Strategies

  • Avoid using system clocks for leases.
  • Implement leader resignation protocols.
  • Monitor and restart nodes after long pauses.

Knowledge, Truth, and Lies in Distributed Systems

The Role of Quorums

  • Majority votes ensure correctness in failure scenarios.
  • Nodes may incorrectly assume they are still active after a failure.

The Split-Brain Problem

  • A node may think it's still the leader while another has taken over.
  • Fencing Tokens prevent outdated clients from corrupting data.

Byzantine Faults

Understanding Byzantine Faults

  • Nodes may fail maliciously, not just crash.
  • Used in blockchains and aerospace systems.

Ensuring Correctness

| Model | Assumptions | |--------|------------| | Synchronous | Fixed limits on network delays | | Partially Synchronous | Mostly behaves like synchronous, occasional outliers | | Asynchronous | No assumptions about timing |

Best Practices

  • Assume nodes can fail unpredictably.
  • Use Quorum-based voting to avoid split-brain scenarios.
  • Design for safety (correctness) and liveness (availability).
Jiayi Li