Distributed Application Architecture Patterns

7.3 Retry

Do not fail because of transient errors

This pattern is based on Retry by Microsoft [101], Retry with backoff by Deenadayalan [102], Request-Response with Retry by Hohpe [103], At-Least-Once delivery by Fehling et al. [15, p. 144, 104] and Request/Reaction by Rotem-Gal-Oz [13, p. 114], and partly on the Busy Signal pattern by Wilder [14, p. 83].

7.3.1 Context

A service connects to an external resource, and this request is either idempotent or has some duplicate handling in place. A failure would negatively affect the currently running operation more than any added latency. The response either signalled the failure is transient explicitly (i.e. with a busy signal [14, p. 83]) or there is a probability it is.

7.3.2 Solution

When a failure occurs, retry the request. However, there are several options on when to retry and for how long.

If latency is an issue, e.g. an interactive situation, and this type of failure is known to be transient or to resolve quickly, retry immediately, with a fixed delay or with a linearly increasing delay, depending on the balance needed.

If this is a long-running operation, an exponentially increasing delay (also known as exponential backoff [105]) would be more suitable to avoid collisions or overloading the service. [14, p. 88, 101]

To prevent collisions, consider adding jitter to the delay [106].

The service might also estimate how long to wait, e.g. if it employs a Rate-Limitter (see § 7.5).

Retry until the failure is recognised as non-transient, or a timeout or a maximum number of retries is reached (see fig. 16).

Figure 16: Retry

7.3.3 Potential issues

If a service fails due to overload, retrying the request might exacerbate the problem. If the retries are long-running, employing an inappropriately-designed backoff algorithm might lead to collisions [106] (also known as the thundering herd problem [107]).

Depending on the implementation, this pattern might introduce delays or increased load (especially if the problem is mistakenly categorised as transient). It may also result in duplicate requests if the network caused the issues and the idempotency was not implemented correctly. If the calling service has its own consumers, it might be better to let it react to the failure instead [12, p. 93].

7.3.4 Example

ExampleEshop sends out transactional emails or newsletters to its users. While this process is not idempotent, sending an email twice is not a critical issue – it is better than not sending it at all, especially if it contains important information such as an invoice or a password reset link. To address this, the system uses a retry mechanism that resends the email if it fails to be delivered due to network problems or a temporary issue in the recipient’s email provider.

7.3.6 Further Reading