Distributed Application Architecture Patterns

6.2 Partitioning (Sharding)

Scale out by separating tasks or data into logical partitions

This pattern is based on two distinct but architecturally similar groups of patterns.

  1. Partitioning data is typically discussed in the context of databases as sharding [3, p. 426] – Wilder’s Database Sharding [14, p. 67], Burns’ Sharded Services [22, p. 59] and Sharding by Microsoft [79]

  2. Partitioning tasks goes by different names, usually as a solution to the ordering issue of Competing ConsumersSequential Convoy by Microsoft [80] or Richardson’s sharded channels [20, p. 94]

6.2.1 Context

The data store or a specific task has become the bottleneck, or they surpass the capacity of a single machine. Still, it is possible to distribute them uniformly using some partitioning logic.

6.2.2 Solution

Instead of creating one shared data store or service, create multiple instances and divide the data or tasks between them (see fig. 6.2).

Figure 10: Partitioning (Sharding)

Since each partition is managed by exactly one instance, the order of operations is guaranteed within a partition, in contrast to Competing Consumers (see § 6.1).

If partitioning data, choose a fixed size of partitions to avoid repartitioning the data if the number of data stores changes [5, p. 219].

6.2.3 Potential issues

Data partitioning logic design can be complex and is fundamental to the system’s performance. Choosing an incorrect scheme might require expensive repartitioning. [3, p. 429]

Since one partition is managed by exactly one instance, additional resiliency measures are needed to handle failures.

6.2.4 Example

ExampleEshop receives an influx of reviews during major events. These reviews must be scanned for inappropriate content or spam before they are displayed and then compressed, which can take some time if the review contains images or videos. To handle the increased load, the system uses Competing Consumers. However, after introducing the ability to update reviews, they noticed a rise in failures.

This was due to users finding mistakes in their text shortly after posting a review containing a video. These edits only contained the edited text, which was processed much faster by a different consumer and failed to save the edit as the review did not exist yet.

ExampleEshop could fix this by disabling edits before they are screened, but this could result in users forgetting to edit their reviews later, decreasing their overall quality and trustworthiness. Additional logic could be introduced to handle this, but this would have to be managed by the queue. Degrading to one consumer wasn’t an option, as the influx of reviews could easily overwhelm it. Instead, they decided to partition the review processors by item. Each item was assigned randomly to a single processor. This way, the system’s ability to scale was not altered significantly, but each review event was guaranteed to be processed in order.

6.2.6 Further reading

6.2.7 Deployment Stamps

An interesting sub-case of partitioning is partitioning users with Microsoft’s Deployment Stamps [83]. If the use cases of the system allow for creating partitions of users that do not have to share data, creating a distributed system might not be necessary at all (see fig. 6.2.7).

Figure 11: Deployment Stamps

Partitions corresponding to geographic regions inherently facilitate low latency (until the users travel).