faultMonitoringStrategy

Description

Monitors the outcome of requests to each node to quickly disable/re-enable faulty ones.

WHY THIS CLASS

This is a drop-in replacement for the default {@link RoundRobinStrategy}. In fact, as long as all nodes respond correctly, it works the same. Things only start changing once nodes fail. Whereas the RoundRobin keeps dispatching to faulty nodes, causing delays for the service user, this strategy detects the issue and favours functional nodes.

FORECAST PREDICTION USING "PERSISTENCE METHOD"

Predicting tomorrow's weather by saying it will be equal to today's works well in many areas of the world. Similarly, predicting that another service request to a node will succeed after a success, or fail after a failure, is very likely. That's the main concept behind the logic and algorithm used within this class.

HOW IT WORKS

Once a node returns with a failure, a fault profile for that node is created. The node is instantly considered to be faulty. From then on, all status (success/failure) of that node is monitored. If enough successive calls are successful, the node is cleared from its bad reputation. If the node is not used anymore (either not requests at all, or none to that node because it's faulty and there are enough good nodes) then the node is cleared from its bad reputation after a configurable amount of time. When there are enough functional nodes (configurable ratio), only the functional ones are used with the simplistic round-robin strategy. If not, then all nodes are used, and the selection is a weighted chance operation by recent success rate.

WHAT IS A FAULT

A fault is when the destination replied with a valid 5xx HTTP status code, or when an exception (such as a ConnectException) was thrown. Everything else, including 4xx codes, is considered a success.

PER NODE, NOT PER DESTINATION

Success status could be monitored per node (host/port) or per destination (host/port/servicename). In practice, most failures affect all services. Rarely only one service on a node is faulty, but it's also very possible. This implementation currently monitors per node. This is for technical reasons, per destination is currently not possible with the information that is present. If it was available, I'm not sure which one is better to choose. Maybe configurable would be nice. If monitoring happens per destination, and one service is detected to be faulty, but in fact the whole node is down, this information will then not be freely available for the other destinations... they have to figure it out independently. Unless some more complicated functionality is built in.

GOALS OF THIS CLASS

Simple super-fast, super-low memory use. The goal is to provide a great service experience to the api user. Over-complicating things in here and introducing possible bugs or bottlenecks must be avoided.

WHAT IT'S NOT

It does not do response time monitoring to identify slow/laggy servers, or to favour faster ones. There can be different kinds of services on the same hosts, simple and complex expensive ones, and we don't have that kind of information here. Another similar DispatchingStrategy could do such logic.

LIMITATIONS

No background checking of faulty nodes. For certain kinds of failures, including ConnectException and UnknownHostException, a background service could keep checking a faulty node, and only re-enable it once that works again. For other cases, there could be a pluggable BackgroundUptimeCheck interface. This would allow a service implementor to write his own check that fits his needs. For example by sending a real service request that does not harm. Automatic background retry of previous failed requests to see if the service is back online is a bad idea... think payment service.

Can be used in

spring:beans, balancer

Attributes

Name	Required	Default	Description
clearFaultyProfilesByTimerAfterLastFailureSeconds	false	300000	When this much time [milliseconds] has passed, and there was no more fault, then the node is cleared from the bad history. It is possible that the node served requests successfully since then, or that it was not called at all. If it was not called, it's very possible that the node is still faulty. But we don't know. What happens is that the node is used again, but if it fails it will instantly have a fault profile.
clearFaultyTimerIntervalSeconds	false	30000	Every this much time [milliseconds] a TimerTask runs to see if there are any fault profiles to clear.
minFlawlessServerRatioForRoundRobin	false	0.5	If at least this many servers in relation to the total number of servers are "flawless", then only the flawless servers are used with a round-robin strategy. The faulty ones are ignored. If too many servers had some issues, then it goes back to using all servers (faulty and flawless) using the weighted-chance strategy.