Latency and Fault Tolerance for Distributed System

Microservices

Also known as the microservices architecture, is an architectural style that structures an application as a collection of services that are

Highly Maintainable and Testable
Loosely Coupled
Independently Deployable
Organized around Business Capabilities
Owned by a Small Team

The microservices architecture enables the rapid, frequent and reliable delivery of large, complex applications. It also enables an organization to evolve its technology stack.

The decentralization of business logic increases the flexibility and most importantly decouples the dependencies between two or more components, this being one of the major reasons as to why many companies are moving from monolithic architecture to a microservices architecture.

What's Hystrix?

Hystrix is designed to do the following:
Give protection from and control over latency and failure from dependencies accessed (typically over the network) via third-party client libraries.
Stop cascading failures in a complex distributed system.
Fail fast and rapidly recover.
Fallback and gracefully degrade when possible.
Enable near real-time monitoring, alerting, and operational control.

In simple terms, it isolates external dependencies so that one may not affect the other.

Hystrix provides 2 isolation strategies, thread and semaphore based isolation, but it’s left to one to decide which one works based on the requirements.

Thread Based Isolation

In this isolation method; the external call gets executed on a different thread (non-application thread), so that the application thread is not affected by anything that goes wrong with the external call.

Hystrix uses the bulkhead pattern. In general, the goal of the bulkhead pattern is to avoid faults in one part of a system to take the entire system down. The term comes from ships where a ship is divided in separate watertight compartments to avoid a single hull breach to flood the entire ship; it will only flood one bulkhead.

Hystrix uses per dependency thread-pool for isolation. So, for every call only the threads from the pool of corresponding dependency are utilized. This ensures that only the pool of the dependency which fails get exhausted and leaving the others unaffected. The number of maximum concurrent requests allowed are based on the thread-pool size defined.

It’s very important to set the right thread-pool size for each of the dependency. A very small pool size may result in requests going into fallback although the downstream may be responding timely. On the other hand a very large one may cause all the threads to be blocked in case the downstream is running with high latencies, thereby degrading the application performance.

How to find the thread-pool size?

As mentioned in hystrix documentation, the correct thread-pool size can be computed with the given formula:

RPS = Requests per second at which we are calling the downstream.
P99 = The 99th percentile latency of downstream.
Pool Size = (RPS * P99) + some breathing room to cater spikes in RPS.

Let’s take an example, the cart service is dependent on product service to get the product details. Assume cart is calling product service at 30 RPS. The product service latencies P99 is 200 ms, P99.5 is 300 ms and Median is 50 ms.

30 rps x 0.2 seconds = 6 + breathing room = 10

In above case thread-pool size is defines as 10. It is sized at 10 to handle a burst of 99th percentile requests, but when everything is healthy this thread-pool will typically only have 1 or 2 active threads at any given time to serve mostly 50 ms median calls. Also it is ensured that only 10 requests are processed concurrently at any point of time, any concurrent requests after the 10th will directly go into the fallback mechanism.

Hystrix also provides an option to configure the queue size, where requests are queued up-to a certain number after the thread-pool is filled. The queued requests will be processed as soon as the threads become free. If the requests cross the queue size then it will directly go into the fallback mechanism.

Thread based isolation also gives an added protection from timing out. If, a thread in the hystrix thread-pool is waiting for response for more than a specified time(read timeout), the request corresponding to that thread is served by the fallback mechanism. So thread based isolation provides 2 layers of protection — concurrency and timeout.

To summarise, the concurrency limit is defined such that, if the incoming requests exceed the limit, it implies that the downstream is not in a healthy state and it is suffering high latencies. Also, if the requests are taking more time than the read timeout to respond, again it’s an indication that the downstream is not in a healthy state as the latency is high.

Semaphore Based Isolation

In this method, the external call runs on the application thread and the number of concurrent calls are limited by the semaphore count defined in the configuration.

The “time out” functionality is not allowed in this method, since the external call is running on application thread and hystrix will not interfere with the thread-pool that doesn’t belong to it. Hence, there is only one level of protection, i.e., the concurrency limit.

You might wonder, what is the advantage of using semaphore based isolation?

The limiting of concurrency is nothing but a way of throttling the load in case a downstream is running with high latencies and semaphore based isolation allows us that. Please note, the thread based isolation adds an inherent computational overhead. Each command execution involves the queueing, scheduling and context switching is involved in running a command on a separate thread. All of these are not required in case of semaphores, making them computationally much lighter as compared to threads.

Generally, semaphore isolation should be used when the external call rate is so high that the overhead of thread creation comes out to be very costly, or when the downstream is trusted to respond or fail quickly, so it will ensure our application threads are not stuck. The logic for setting up the semaphore size is same as the one which use for setting thread-pool size, but the overhead when using semaphores is much less and will result in faster execution.

Catch with Thread Isolation

As you know by now, if a downstream application call takes more than the defined timeout to return the result, the caller is served using the fallback defined. Please note this fallback is not served by the hystrix thread that was created for the external call, infact it will keep waiting until it gets a response or an exception from the downstream application. We can’t force the latent thread to stop the work; the best hystrix can do is to throw an InterruptedException. If the implementation wrapped by hystrix doesn’t respect InterruptedException, then thread will continue it’s work though the caller would have received the response from the fallback.

We can handle this situation by defining the read timeout as close as possible with hystrix thread timeout. Once the thread timeout is pass, the external call receives an exception which will mark the task of hystrix thread as completed and is returned to the thread-pool.

To validate the above scenarios, I have created two microservices one is named as rest-producer and other one is rest-consumer.

rest-producer microservice

This service exposes a simple API which takes username as a path variable, prefix Welcome to it and returns back to the caller.

@RestController
@RequestMapping("/api/users")
public class UserController {
@GetMapping("/{username}")
public String getWelcomeMessage(@PathVariable("username") String userName) {
try {
Thread.sleep(4000);
} catch(Exception ex) {
ex.printStackTrace();
}
StringBuilder stringBuilder = new StringBuilder()
.append("Welcome ").append(userName).append(" !\n");
return stringBuilder.toString();
}
}

As usual, you can find the source on github for rest-producer service.

rest-consumer microservice

This service exposes a simple API /api/greet/{username} which internally calls a service that will invoke the rest-producer service API /api/users/{username} using the restTemplate.

@Slf4j
@RestController
@RequestMapping("/api/greet")
public class GreetingController {
@Autowired
private GreetingService greetingService;
@GetMapping("/{username}")
public String getGreetingMessage(@PathVariable("username") String userName) {
try {
log.warn("Main Thread: {} ({})",
Thread.currentThread().getName(),
Thread.currentThread().getId());
return greetingService.getMessage(userName);
} finally {
log.warn("Main Thread: {} ({})",
Thread.currentThread().getName(),
Thread.currentThread().getId());
}
}
}

RestTemplate Bean

@Bean
public RestTemplate restTemplate() {
HttpComponentsClientHttpRequestFactory httpComponentsClientHttpRequestFactory
= new HttpComponentsClientHttpRequestFactory();
httpComponentsClientHttpRequestFactory.setConnectTimeout(5000);
httpComponentsClientHttpRequestFactory.setReadTimeout(5000);
return new RestTemplate(httpComponentsClientHttpRequestFactory);
}

GreetingService Implementation

@Slf4j

@Service

public class GreetingService {

@Autowired

private RestTemplate restTemplate;

@HystrixCommand(fallbackMethod = "fallback_getMessage",

commandProperties = {

@HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", value = "3000")

})

public String getMessage(String userName) {

log.warn("Hystrix Thread: {} ({})",

Thread.currentThread().getName(),

Thread.currentThread().getId());

String greetMessage = restTemplate.exchange("http://localhost:9191/api/users/{username}",

HttpMethod.GET, null,

new ParameterizedTypeReference<String>() {}, userName).getBody();

log.warn("Hystrix Thread: {} ({}), Msg: {}",

Thread.currentThread().getName(),

Thread.currentThread().getId(), greetMessage);

return greetMessage;

}

public String fallback_getMessage(String userName) {

log.warn("Hystrix Fallback Thread: {} ({})",

Thread.currentThread().getName(),

Thread.currentThread().getId());

return "Fallback Message " + userName + " !";

}

If you notice, I have given connection timeout as 5 secs, read timeout as 5 secs and hystrix thread timeout as 3 secs. The downstream API has been intentionally made to respond after 4 secs. So it will demonstrate that the call from rest-consumer will be served with fallback because the hystrix thread timeout is set to 3 secs. It’s time to hit the rest-consumer API either from cURL or Postman.

curl -X GET 'http://localhost:9090/api/greet/Rogers'

You should see the response as below

Fallback Message - Rogers !

Please take a look at the log messages on the rest-consumer service.

2021-02-13 12:12:34.945 WARN 28673 --- [nio-9090-exec-1] c.b.h.controller.GreetingController : Main Thread: http-nio-9090-exec-1 (21)

2021-02-13 12:12:35.284 WARN 28673 --- [GreetingService-1] c.b.hystrix.service.GreetingService : Hystrix Thread: hystrix-GreetingService-1 (42)

2021-02-13 12:12:38.279 WARN 28673 --- [ HystrixTimer-1] c.b.hystrix.service.GreetingService : Hystrix Fallback Thread: HystrixTimer-1 (41)

2021-02-13 12:12:38.283 WARN 28673 --- [nio-9090-exec-1] c.b.h.controller.GreetingController : Main Thread: http-nio-9090-exec-1 (21)

2021-02-13 12:12:39.394 WARN 28673 --- [GreetingService-1] c.b.hystrix.service.GreetingService : Hystrix Thread: hystrix-GreetingService-1 (42), Msg: Welcome Rogers !

As you can see, the main thread ID is 21 which is working in controller and hystrix spawns a thread with ID 42 to make an external call to the downstream service. Now the thread 42 will be on waiting state till it gets response from the downstream service. But the thread timeout has been set to 3 secs and hystrix will launch another thread with ID 41 which will give a fallback message to the controller and you see that main thread with ID 21 returns the fallback response to the caller.

Please observe carefully, the last log statement which is printed after 4 secs which is nothing but the thread with ID 42 which was on waiting state receives the actual response from the downstream after 4 secs and it prints the response. But it’s of no use as the original request is already returned with fallback message. So it proves that the hystrix thread that’ s used to make an external call will be blocked until it gets response/exception from the downstream service. Hence it’s very important to setup the http read timeout and thread timeout as same so we can avoid blocking the hystrix thread and release it back to the pool, otherwise the hystrix thread will exhaust if we have high traffic.

As usual, you can find the source on github for rest-consumer service.

With great decentralization comes a great need of resilient fault tolerance. As Netflix says, fault tolerance is not a feature, it’s a requirement.

Conclusion

Simply wrapping Hystrix around external calls does not guarantee an effective fault tolerance. Generally, it has been observed that the hystrix configurations which we discussed above like thread-pool size, isolation strategy, max concurrency limit, read timeout etc are not configured explicitly and left to use their default values. As explained configuring them as per the requirements is very important to ensure an effective fault tolerance for an application.

Configuration Reference

Hystrix

Alternate for Hystirx

1 comment:

Gerard ButlerMarch 31, 2022 at 4:23 PM
I truly appreciate the time and work you put into sharing your knowledge. I found this topic to be quite effective and beneficial to me. Thank you very much for sharing. Continue to blog.

Data Engineering Solutions

Machine Learning Solutions

Data Analytics Solutions

Data Modernization Solutions

The Bhargav Journal

Wednesday, March 24, 2021