DDIA Reading Notes - Chapter 6. Partitioning

Feb 14, 2025

Partitioning vs. Replication

Partitioning: Splitting a large dataset into smaller subsets (shards) for scalability (distributing data across disks and queries across processors).

Replication: Maintaining multiple copies of data across nodes for availability and fault tolerance.

Partitioning of Key-Value Data

Partitioning by Key Range

Advantages:

  1. Efficient Range Scans

    • "Range scans are easy, and you can treat the key as a concatenated index to fetch several related records in one query."

    • Contiguous key ranges (e.g., A-D, E-G) enable linear traversal of partitions.

    • Examples:

      • Querying logs within a timestamp range (e.g., 2023-01-01 to 2023-01-31).

      • Fetching all orders for a customer user123 sorted by date.

  2. Concatenated Keys for Related Records

    • Keys can encode multiple attributes (e.g., user_id:timestamp), allowing multi-dimensional queries.

    • Example:

      • Querying sensor data by device_id and a time range (device_xyz:2023-01-01 to device_xyz:2023-01-15).

Drawbacks:

  1. Data Skew

    • "The ranges of keys are not necessarily evenly spaced, because your data may not be evenly distributed."

    • Manual or fixed-range partitioning (e.g., splitting by alphabet) can lead to uneven data distribution.

    • Example:

      • "Simply having one volume per two letters of the alphabet would lead to some volumes being much bigger than others."

      • Partitioning by A-D and E-G might result in vastly different data volumes if most keys start with A.

  2. Hotspots

    • Sequential or time-ordered keys concentrate load on specific partitions.

    • Example:

      • Time-series data (e.g., logs with timestamps) creates a hotspot for the latest partition (e.g., "today’s partition").

      • Frequent writes/reads to this partition overload a single node, while others remain underutilized.

Partitioning by Hash of Key

Because of this risk of skew and hot spots, many distributed datastores use a hash function to determine the partition for a given key. A good hash function takes skewed data and makes it uniformly distributed.

Advantages

  1. Load Balancing

    • "Even if the input strings are very similar, their hashes are evenly distributed."

    • Mitigates skew and hotspots caused by uneven key distributions.

  2. Simplicity

    • No need for cryptographically strong hashes (e.g., MD5, SHA-1 suffice).

Drawbacks

  1. Loss of Range Queries

    • "By using the hash of the key, we lose the efficient range scan property of key-range partitioning."

    • Example: MongoDB’s hash-based sharding requires scatter-gather for range queries.

  2. Limitations of Built-in Hash Functions

    • Language-specific hash functions (e.g., Java’s Object.hashCode(), Ruby’s Object#hash) may produce inconsistent values across processes.

Cassandra’s Compound Key Strategy

  • Use a compound primary key (e.g., (partition_key, clustering_columns)).

  • Only the first column is hashed for partitioning.

  • Subsequent columns enable efficient range scans within a partition.

  • Example:

    • Partition by user_id (hashed), then sort by timestamp (range scan within the partition).

Skewed Workloads and Relieving Hot Spots

Even with hashing-based partitioning, extreme cases can still cause hotspots (e.g., when all requests target the same key).

Solution: Introduce randomization to distribute load more evenly across partitions.

  • Add a random number (e.g., a two-digit decimal) as a prefix or suffix to the key.
  • This spreads writes across multiple keys (e.g., 23_user123 instead of user123), allowing data to be distributed to different partitions.
Jiayi Li