Bulkheads | Distributed Application Architecture Patterns

7.1 Bulkheads

Use logical partitions to isolate failures

This pattern is based on Bulkheads by Nygard [12, p. 98], later by Newman [3, p. 400] and Microsoft [96].

“Bulkheads” are a concept borrowed from shipbuilding, where, among other things, they are used to divide the inner space into rooms and to prevent a breach in one part of the ship from leaking into the whole vessel [97]. Translating this concept to software architecture means introducing some kind of logical partitions into the system to prevent failures from affecting other parts of the system.

In essence, this is the Partitioning pattern (see § 6.2) applied for a different purpose – to increase resiliency instead of accommodating scalability.

7.1.1 Context

There are multiple interconnected parts of an application. A failure in a single part has the potential to cause failures in others, and it is desirable to keep some parts of the application running in such an event.

7.1.2 Solution

Create logical partitions in a system aimed at containing failures (see fig. 14). There are several levels where bulkheads can be implemented.

In an individual service, this can mean using different threads, processes or pools¹ for different tasks and independently manage their resource limits and priorities to prevent starvation.

On the component level, this can be done by avoiding sharing resources between services (or between clients), i.e. removing single points of failure. This means if there is a failure in one of the resources – or one of its clients overloads it – it will not affect all its clients.

On the deployment level, this can mean running each bulkhead on different nodes, isolating hardware failures, or using virtualisation.

At the highest level, this means designing a system with bulkheads in mind, identifying mission-critical services and planning for potential failures so that as much of the system as possible continues operating independently – or with limited functionality.

7.1.3 Potential issues

Employing bulkheads more leads to more resource usage and increased complexity. It may also be difficult to introduce them into a tightly-coupled system.

7.1.4 Example

ExampleEshop has defined mission-critical functions that need to be kept running at all costs, as outages would result in significant financial losses. These include basic product page functionality, basic search and order processing. To achieve this, they are deployed separately from other services and have their dedicated resources. They communicate with dependencies using Ciruit Breakers. If other features such as the recommendation service, full text search or product reviews are unavailable, they continue to function, either by using older, cached data or by providing a degraded experience.

Retry might prevent another bulkhead from failing if the failure is only transient
Circuit Breakers can be used to automatically “seal” a bulkhead, speeding up fault handling and allowing the affected bulkhead to recover

Pools in this context mean any kind of pooled resource, e.g. worker pools [98]↩︎

Distributed Application Architecture Patterns

An unopinionated catalogue of the status quo

7.1 Bulkheads

7.1.1 Context

7.1.2 Solution

7.1.3 Potential issues

7.1.4 Example

7.1.5 Related patterns