Health Monitoring | Distributed Application Architecture Patterns

7.4 Health Monitoring

Proactively check and react to service failures

Health monitoring is a broad topic covered by many different but related patterns. This pattern is based on the following: Health Endpoint Monitoring by Microsoft [108], Watchdog by Fehling et al. [15, p. 260, 109], Service Watchdog by Rotem-Gal-Oz [13, p. 67], Health Check API by Richardson [20, p. 366, 110], and HeartBeat by Joshi [5, p. 93, 111].

It is also part of the Control Bus and similar to the Test Message patterns by Hohpe et al. [4, pp. 477, 498, 112, 113].

7.4.1 Context

The system needs to adhere to availability or reliability requirements. Failures in the system need to be detected and either automatically resolved or reported.

7.4.2 Solution

Proactively monitor the health of services (see fig. 17). The services either implement a health check endpoint or send heartbeats to a central service. The fidelity of the health checks can vary from simple up/down checks to more complex checks that verify the service’s resource usage, response time, internal state or dependencies. This allows to cover a wider range of potential issues than just whether the service is running or not.

In case of a failure or timeout, the system can automatically respond by restarting the service, redirecting traffic to a healthy instance, or alerting an operator.

7.4.3 Potential issues

Imperfectly designed or tuned health checks can lead to false positives, false negatives or performance issues. A check that is too strict can cause unnecessary service restarts, traffic redirection, or performance degradation if it is required too frequently or requires expensive tests.

On the other hand, checks that are too lenient or test too little can lead to undetected issues, rendering the monitoring useless.

They can also introduce security risks by exposing sensitive information or creating a vector for denial-of-service attacks. This can be mitigated by endpoint authentication or obfuscation [108].

False negatives can also occur due to failures in the monitoring system itself. Measures to mitigate this should be taken to ensure correct operation, such as redundancy or self-checks.

7.4.4 Example

ExampleEshop uses health monitoring for all its running services. If the service has an API, it exposes an authenticated /health endpoint, which is polled regularly. This endpoint is configured to check the service’s dependencies, such as the database or the message broker. For databases, it runs a test query. The system also monitors and logs the response times.

If the service fails, the monitoring system signals the infrastructure to restart it and the load balancer to not route traffic to it. If it fails repeatedly or is under a long-term high load, the system sends an alert to the operations team. For the parts critical to the operation of the eshop, it also employs more sophisticated checks that are run less frequently, such as for the search functionality, where it runs a pre-defined query and checks if the results are as expected, or for the order processing, where it places a test order and checks if it is successful.

Health monitoring can be linked to a Circuit Breaker to proactively respond to service state instead of relying on requests to fail
The Leader and Followers pattern can be used to autonomously detect and recover from failures if the system is dependent on a single coordinating instance

7.4.6 Further reading

Nygard [12, p. 162] on transparency
Newman [3, p. 305] on monitoring and observability
Richardson on designing observable services [20, p. 364]
Wilder on how to react to node failures in [14, p. 93]

Distributed Application Architecture Patterns

An unopinionated catalogue of the status quo