Saga | Distributed Application Architecture Patterns

10.3 Saga

Sacrifice isolation for increased availability during a distributed transaction

This pattern is based on Saga by García-Molina et al. [172]. Although originally intended to implement long-living transactions (LLTs)¹, it has been interpreted for use in distributed systems in general, see Rotem-Gal-Oz [13, p. 129], Richardson [20, p. 114, 173], Newman [3, p. 182], Richards et al. [2, p. 261], Microsoft [174], and Deenadayalan [175].

10.3.1 Context

Multiple services need to participate in a distributed transaction of a business process, but other ACID algorithms, such as a two-phase commit, are not an option (see the context of § 10.1 for more details). Not having the transaction isolated is an acceptable trade-off.

10.3.2 Solution

Break the long-lived transaction into smaller short-lived transactions. In the case of a business failure, the system can apply either:

Forward recovery, by retrying the transaction from a particular save-point, if the failed operation was idempotent
Backward recovery, by running compensating transactions to undo the operations that were already performed
Combine both approaches based on the context (see fig. 29) [3, p. 189]

Executing steps with a high probability of failure can decrease the amount of compensating transactions needed in case of a failure [3, p. 188].

There are two possible ways to implement a saga: Choreography, an event-driven approach suitable primarily for simpler sagas (see § 10.4), or Orchestration (see § 10.5). It may also be possible to nest these two styles based on the requirements of the process [3, p. 194, 176].

10.3.3 Potential issues

If possible, always avoid distributed transactions, as it introduces a large amount of additional complexity to the system [2, pp. 132, 260, 3, p. 181, 20, p. 114].

Not all business operations can be fully rolled back. The compensating transactions thus have to be semantic rollbacks, i.e. roll back enough in the context of the saga [3, p. 187]. These can be difficult to design [177, p. 18:20].

Since the saga is only ACD², isolation needs to be handled separately [see 20, p. 126 for common approaches].

A common misconception is that a saga can handle technical failures. Different mechanisms are needed to handle such cases, such as Retry (see § 7.3) or sacrificing availability, e.g. with a two-phase commit [178].

10.3.4 Example

Order processing in ExampleEshop is a complex process that involves multiple services. When a user places an order, the system needs to check the availability of the products, reserve them, charge the user, notify the warehouse, and send a confirmation email. If any of these steps fail, the system needs to roll back the changes and notify the user.

The steps in the saga are ordered to minimize the compensating transactions needed by running the step most likely to fail first.

Charge the user (checkpoint)
Send a confirmation email
Reserve all products
Send them out for delivery

The payment marks a checkpoint from which the system proceeds with forward recovery. If not all products are in stock, the system can notify the user and retry once the products are available again. Alternatively, if the customer decides to cancel the order, the system can undo the reservation and issue a refund.

Event Sourcing (see § 10.6) can be used to implement compensating transactions [2, p. 132]

10.3.6 Further reading

Compensating Transaction by Microsoft [95]
Long-running Transactions [179] and Compensating Transactions [180] on Wikipedia

Transactions locking resources in a database for a long time, causing contention↩︎
Missing the “I” from ACID ↩︎

Distributed Application Architecture Patterns

An unopinionated catalogue of the status quo