DDIA Reading Notes - Chapter 6. Partitioning

Partitioning vs. Replication

Partitioning: Splitting a large dataset into smaller subsets (shards) for scalability (distributing data across disks and queries across processors).

Replication: Maintaining multiple copies of data across nodes for availability and fault tolerance.

Partitioning of Key-Value Data

Partitioning by Key Range

Advantages:

Efficient Range Scans
- "Range scans are easy, and you can treat the key as a concatenated index to fetch several related records in one query."
- Contiguous key ranges (e.g., A-D, E-G) enable linear traversal of partitions.
- Examples:
  - Querying logs within a timestamp range (e.g., 2023-01-01 to 2023-01-31).
  - Fetching all orders for a customer user123 sorted by date.
Concatenated Keys for Related Records
- Keys can encode multiple attributes (e.g., user_id:timestamp), allowing multi-dimensional queries.
- Example:
  - Querying sensor data by device_id and a time range (device_xyz:2023-01-01 to device_xyz:2023-01-15).

Drawbacks:

Data Skew
- "The ranges of keys are not necessarily evenly spaced, because your data may not be evenly distributed."
- Manual or fixed-range partitioning (e.g., splitting by alphabet) can lead to uneven data distribution.
- Example:
  - "Simply having one volume per two letters of the alphabet would lead to some volumes being much bigger than others."
  - Partitioning by A-D and E-G might result in vastly different data volumes if most keys start with A.
Hotspots
- Sequential or time-ordered keys concentrate load on specific partitions.
- Example:
  - Time-series data (e.g., logs with timestamps) creates a hotspot for the latest partition (e.g., "today’s partition").
  - Frequent writes/reads to this partition overload a single node, while others remain underutilized.

Partitioning by Hash of Key

Because of this risk of skew and hot spots, many distributed datastores use a hash function to determine the partition for a given key. A good hash function takes skewed data and makes it uniformly distributed.

Advantages

Load Balancing
- "Even if the input strings are very similar, their hashes are evenly distributed."
- Mitigates skew and hotspots caused by uneven key distributions.
Simplicity
- No need for cryptographically strong hashes (e.g., MD5, SHA-1 suffice).

Drawbacks

Loss of Range Queries
- "By using the hash of the key, we lose the efficient range scan property of key-range partitioning."
- Example: MongoDB’s hash-based sharding requires scatter-gather for range queries.
Limitations of Built-in Hash Functions
- Language-specific hash functions (e.g., Java’s Object.hashCode(), Ruby’s Object#hash) may produce inconsistent values across processes.

Cassandra’s Compound Key Strategy

Use a compound primary key (e.g., (partition_key, clustering_columns)).
Only the first column is hashed for partitioning.
Subsequent columns enable efficient range scans within a partition.
Example:
- Partition by user_id (hashed), then sort by timestamp (range scan within the partition).

Skewed Workloads and Relieving Hot Spots

Even with hashing-based partitioning, extreme cases can still cause hotspots (e.g., when all requests target the same key).

Solution: Introduce randomization to distribute load more evenly across partitions.

Add a random number (e.g., a two-digit decimal) as a prefix or suffix to the key.
This spreads writes across multiple keys (e.g., 23_user123 instead of user123), allowing data to be distributed to different partitions.