Photo by Brett Jordan |
One of the key characteristic of microservices architecture is inter-service communication. We can split a monolithic application into multiple smaller applications called microservices. Each microservice is responsible for a single feature or domain and can be deployed, scaled, and maintained independently.
Since microservices are distributed in nature, various things can go wrong at any point of time. The network over which we access other services or services themselves can fail. There can be intermittent network connectivity errors or firewall issues. Individual services can fail due to service unavailability, coding issue, out of memory errors, deployment failure, hardware failure and etc., to make our services resilient to these failures, we adopt the retry pattern which is also known as stability pattern.
Retry Pattern
The idea behind the retry pattern is quite simple. If service A makes a call to service B and receives an unexpected response for a request, then service A will send the same request to service B again hoping to get an expected response.
Retry Pattern Representation |
There are several retry strategies that can be applied depending on the failure type or nature of the requirements.
Immediate Retry
This strategy is the basic one. In this approach, calling service handles the unexpected failure and immediately makes the request again. This strategy can be useful for unusual failures that occur intermittently. The chances of success are high by just retrying in these cases.
Retry After Delay
In this strategy, we introduce a delay before retrying service call again, hoping that the cause of the fault would have been rectified. Retry after delay is an appropriate strategy when a request timeout occurs due to busy or failures or network-related issue.
Sliding Retry
In this strategy, the service will continue to retry the service call by adding an incremental time delays on each subsequent attempts. For example, the first retry may wait 500 MS, the second will wait 1000 MS, the third will wait 1500 MS until the retry count has not been exceeded. By adding an increasing delay, we reduce the number of retries to the service and avoid adding any additional load to a service which is already overloaded.
Retry with Exponential Backoff
In this strategy, we take the Sliding Retry strategy and ramp up the retry delay exponentially. If we started with a 500 MS delay, we would retry again after 1500 MS, then 3000 MS. Here we are trying to give the service more time to recover before we try to invoke it again.
Abort Retry
As we understand, we can't have a retry process happening forever. We need to have a threshold on the maximum number of retry attempts, we try for a failed service call. We need to maintain the counter and when it reaches the threshold value, our best strategy is to abort the retry process and let the error propagate to the calling service.
Conclusion
The retry pattern allows the calling service to retry failed attempts with a hope that the service will respond within an acceptable time.
With the varying interval between retries we provide the dependent service more time to recover and respond for our request.
It is recommend that, we need to keep a track of failed operations as it will be very useful information to find recurring errors and also the required infrastructure like thread pool, thread strategy etc.
At some point, we just need to abort the retry and we must acknowledge that the service is not responding and notify the calling service with an error.