Faults and Partial Failures

Single-machine failures are typically binary (working or completely down).
In distributed systems, components may fail independently, causing complex failure modes.
Examples: network partitions, hardware failures, slow responses.

Cloud Computing vs. Supercomputing

| Aspect | Supercomputing | Cloud Computing | |--------|---------------|-----------------| | Failure Handling | Entire system stops and restarts | Some parts fail, others keep running | | Hardware | High-reliability, specialized | Commodity, higher failure rates | | Use Case | Scientific computing | Web services, scalable apps |

Cloud systems must tolerate partial failures and implement redundancy, failover mechanisms, and automated recovery.

Building Reliable Systems from Unreliable Components

Error-Correcting Codes (ECC): Recover corrupted data.
TCP: Ensures packet reliability despite unreliable IP.
Distributed Databases: Replication and consensus (Paxos, Raft) provide availability and consistency.

Best Practices for Handling Failures

Expect Failures: Assume nodes will fail.
Use Redundancy: Replicate data and services.
Design for Partial Failures: Avoid system-wide crashes.
Monitor & Detect Failures: Health checks and alerts.
Test Failure Scenarios: Use Chaos Engineering.

Unreliable Networks in Distributed Systems

Why Networks Are Unreliable

Packets may be lost, delayed, or duplicated.
Network partitions and congestion introduce uncertainty.

Handling Network Failures

Timeouts: Assume failure after a set delay.
Explicit Error Messages: Use TCP RST, ICMP messages.
Retries: Be mindful of duplicate operations.
Leader Election: Handle leader failures gracefully.
Network Partitions: Use quorum-based decision-making.

Best Practices

Assume failures will happen.
Monitor network health.
Test network failures (Chaos Engineering).
Design idempotent operations to handle duplicate requests.

Timeouts and Unbounded Delays

Challenges

Fixed timeouts can be too long (slow detection) or too short (false positives).
Network congestion, TCP retransmissions, and virtualization add unpredictability.

Solutions

Adaptive Timeouts: Adjust dynamically.
Exponential Backoff: Increase delay between retries.
Phi Accrual Failure Detection: Used in systems like Cassandra.

Unreliable Clocks in Distributed Systems

Problems with Clocks

Clock Drift: Different machines have unsynchronized clocks.
NTP Issues: Network delays affect synchronization.
Leap Seconds: Can cause unexpected behavior.

Best Practices

Use Monotonic Clocks for measuring time intervals.
Use Logical Clocks (Lamport, Vector Clocks) instead of physical time.
Fencing Tokens prevent outdated clients from modifying data.

Process Pauses and Leader Election

Common Causes

Garbage Collection (GC) Pauses
Virtual Machine (VM) Suspension
CPU Load and Slow I/O

Mitigation Strategies

Avoid using system clocks for leases.
Implement leader resignation protocols.
Monitor and restart nodes after long pauses.

Knowledge, Truth, and Lies in Distributed Systems

The Role of Quorums

Majority votes ensure correctness in failure scenarios.
Nodes may incorrectly assume they are still active after a failure.

The Split-Brain Problem

A node may think it's still the leader while another has taken over.
Fencing Tokens prevent outdated clients from corrupting data.

Byzantine Faults

Understanding Byzantine Faults

Nodes may fail maliciously, not just crash.
Used in blockchains and aerospace systems.

Ensuring Correctness

| Model | Assumptions | |--------|------------| | Synchronous | Fixed limits on network delays | | Partially Synchronous | Mostly behaves like synchronous, occasional outliers | | Asynchronous | No assumptions about timing |

Best Practices

Assume nodes can fail unpredictably.
Use Quorum-based voting to avoid split-brain scenarios.
Design for safety (correctness) and liveness (availability).

DDIA Reading Notes - Chapter 8

Faults and Partial Failures

Cloud Computing vs. Supercomputing

Building Reliable Systems from Unreliable Components

Best Practices for Handling Failures

Unreliable Networks in Distributed Systems

Why Networks Are Unreliable

Handling Network Failures

Best Practices

Timeouts and Unbounded Delays

Challenges

Solutions

Unreliable Clocks in Distributed Systems

Problems with Clocks

Best Practices

Process Pauses and Leader Election

Common Causes

Mitigation Strategies

Knowledge, Truth, and Lies in Distributed Systems

The Role of Quorums

The Split-Brain Problem

Byzantine Faults

Understanding Byzantine Faults

Ensuring Correctness

Best Practices