6.2 Partitioning (Sharding)
Scale out by separating tasks or data into logical partitionsThis pattern is based on two distinct but architecturally similar groups of patterns.
Partitioning data is typically discussed in the context of databases as sharding [3, p. 426] – Wilder’s Database Sharding [14, p. 67], Burns’ Sharded Services [22, p. 59] and Sharding by Microsoft [79]
Partitioning tasks goes by different names, usually as a solution to the ordering issue of Competing Consumers – Sequential Convoy by Microsoft [80] or Richardson’s sharded channels [20, p. 94]
6.2.1 Context
The data store or a specific task has become the bottleneck, or they surpass the capacity of a single machine. Still, it is possible to distribute them uniformly using some partitioning logic.
6.2.2 Solution
Instead of creating one shared data store or service, create multiple instances and divide the data or tasks between them (see fig. 6.2).
Since each partition is managed by exactly one instance, the order of operations is guaranteed within a partition, in contrast to Competing Consumers (see § 6.1).
If partitioning data, choose a fixed size of partitions to avoid repartitioning the data if the number of data stores changes [5, p. 219].
6.2.3 Potential issues
Data partitioning logic design can be complex and is fundamental to the system’s performance. Choosing an incorrect scheme might require expensive repartitioning. [3, p. 429]
Since one partition is managed by exactly one instance, additional resiliency measures are needed to handle failures.
6.2.4 Example
ExampleEshop receives an influx of reviews during major events. These reviews must be scanned for inappropriate content or spam before they are displayed and then compressed, which can take some time if the review contains images or videos. To handle the increased load, the system uses Competing Consumers. However, after introducing the ability to update reviews, they noticed a rise in failures.
This was due to users finding mistakes in their text shortly after posting a review containing a video. These edits only contained the edited text, which was processed much faster by a different consumer and failed to save the edit as the review did not exist yet.
ExampleEshop could fix this by disabling edits before they are screened, but this could result in users forgetting to edit their reviews later, decreasing their overall quality and trustworthiness. Additional logic could be introduced to handle this, but this would have to be managed by the queue. Degrading to one consumer wasn’t an option, as the influx of reviews could easily overwhelm it. Instead, they decided to partition the review processors by item. Each item was assigned randomly to a single processor. This way, the system’s ability to scale was not altered significantly, but each review event was guaranteed to be processed in order.
6.2.5 Related patterns
Even though the partitions do not increase the system’s overall resiliency per se, this can still be used to one’s advantage to prevent cascading failures. Introducing logical partitions for this purpose is known as the Bulkhead pattern (see § 7.1)
An Ambassador can used to implement sharding logic on the client side (see § 5.2) [22, p. 22]
6.2.6 Further reading
Newman on data partitioning in [3, p. 426]
Joshi’s Fixed Partitions pattern if the amount of participating data stores can be variable [5, p. 219] and Key-Range partitions to optimise for queries with ranges [5, p. 243]
Shard (database architecture) on Wikipedia [81]. Also see [82] for a similar concept in storage systems
6.2.7 Deployment Stamps
An interesting sub-case of partitioning is partitioning users with Microsoft’s Deployment Stamps [83]. If the use cases of the system allow for creating partitions of users that do not have to share data, creating a distributed system might not be necessary at all (see fig. 6.2.7).
Partitions corresponding to geographic regions inherently facilitate low latency (until the users travel).