3

Default values are often suggested to be part of failover mechanism for microservices. At a high level, for any services (say microservices here) the nature of the operation can be broadly classified as Read Or Write.

  1. Having default values for Write operations doesn't sound reliable.
  2. Return values for Read operations in terms of data size can possibly(??) be categorized as follows:

    • Read returning Small/Medium Size Data
    • Read returning Huge amount of data

Let's assume the source of data is a Highly Available Cache [used for performance, round-trip avoidance etc and has it's own refresh cycle].

Now when the cache is down failover plan can be:

  • When the data size is small - Going back to actual system to fetch the data, (assuming time taken is range milli secs) over a real time invocation, seems ok.
  • When the data size is huge and a real time invocation takes time several mins, doing it over a synchronous call, doesn't seem to be correct.

The solutions I can think of are as follows:

  1. Keep the actual data in a persistent storage which is backed by High Availability and use it as a fallback. So the data availability will now be controlled by HA Policy of the persistent storage
  2. Use some kind of Caching for request. The Cache can have a fixed upper limit on size and can be used to keep latest request only. The cache can be reset periodically (with the same frequency of the HA Cache refresh) after checking health of the HA Cache. If HA Cache is available, the Request Cache can be reset, else it can retain it's last state. This essentially moves the data availability assurance to platform hosting the microservice(s)

Would be really helpful to know from the community, which among the above is better fit OR is there is any other better way of handling the problem described in (2)?

Billal Begueradj
  • 1,305
  • 5
  • 16
  • 31
Divs
  • 187
  • 7
  • you are over thinking this. if the response takes minutes to generate on a cache miss then you need to cater for that happening in general usage. ie an async process of some sort – Ewan Dec 28 '17 at 09:13
  • @Ewan Please refer to my further explanation to Arseni's answer and let me know what you think. – Divs Dec 28 '17 at 09:54
  • he's right, you don't need the cache. just put the results in a db – Ewan Dec 28 '17 at 09:57
  • @Ewan you mean to say, to put in a db to handle fallback? or use a db instead of using a cache? Why do you think I don't need a cache? I'm using it for performance reasons only. The nature of the UC allows me to use it as a primary source of data. I am keeping the data latest in the cache anyway.. – Divs Dec 28 '17 at 11:15
  • The trouble with your question is that either your architecture is just completely wrong, or there is some complexity to your setup that isn't coming across. You are focused on the cache and don't talk the data at all. Have you measured performance with just a database? what is it about the cache which makes it faster? – Ewan Dec 28 '17 at 12:34
  • @Ewan - The Cache is a IMDG, supporting Object Query capabilities. If I am not using the data other than a READ purpose, do we really need to invest in creating and maintaining complex data models of many (read 100 s) of attributes and writing queries which are similarly complex to fetch the data, for which we can have a simple and highly performant approach of using an IMDG?? – Divs Dec 28 '17 at 13:53
  • you need to put this stuff in the question if you want a good answer. It sounds like you are not using the cache for what it was designed for and are now trying to justify your hack – Ewan Dec 28 '17 at 14:18

1 Answers1

3

You're overthinking it.

A distributed cache's goal is to optimize performance, nothing more. If you expect data to always be cached, your design is flawed. You may not have the data for a bunch of reasons other (and much more current) than the unavailability of the cache service:

  • The data was not cached yet,
  • The data became obsolete and should be regenerated,
  • The cache service was low on memory and removed the item from the cache.

For this reason, you have to consider the scenario of the data not being in cache anyway. In your case, this means that you should explicitly handle the case of a request taking minutes (by doing it asynchronously).

The only additional problem you get with the cache service being potentially down is not worth long requests, but many short ones. If you expect to do a few hundred of requests (1 ms. each) per second to the cache and the cache stops responding, meaning that every request now takes 500 ms. (timeout), you'll possibly exhaust the pool of HTTP connections, not counting the consequences on your users' experience. To protect yourself from this scenario, use microservices' circuit breaker pattern.


Following your comment, it seems further explanation is needed. I think a difficulty with your question is that you're talking about huge amounts of data, assuming that it will take a matter of seconds or milliseconds to get this data from cache, but a matter of minutes to get it without cache. If it's only question of data size, it will still take minutes to get the data from cache as well.

Let's imagine another scenario. The response is relatively small and can be downloaded in a matter of milliseconds, but it takes minutes to generate the response in the first place. Here, cache represents a huge performance improvement.

In this case, the service may respond with a HTTP 202 Accepted indicating that it started generating the response, the response will be available later on. It effectively means that the client will have to handle two cases: the one where the answer is ready (HTTP 200), and the one where the response should be regenerated (HTTP 202) and act accordingly.

Does persistent storage work as a fallback mechanism? Sure. Personally, I would rather prefer not using one, since it makes the whole system quite complex. As soon as you store data in two different systems (primary cache and fallback storage), maintenance tends to become too costly, and you'll have to deal with invalidation properly (for instance, what happens if the primary cache is invalidated, but when attempting to invalidate the data in persistent storage, it fails?) Moreover, how far could you get with that? Don't you need to handle the case where both the cache and the persistent storage are empty?

In my opinion, the goal should be to:

  • Make sure cache service is reliable. If it's down once per day for an hour, you have more important things to do than to think about fallback strategies.

  • Ensure the case where the cache service is down (or empty) is handled, i.e. that the user sees something other than “Server error”. Depending on the specific case, it may be as simple as an explicit and helpful error message explaining what happened and what the user can do next. Or it may be a mechanism where the user will have to wait for a few minutes to get the content regenerated. Or it may be a very complex fallback mechanism which ensures 99.99999% reliability. It's up to you to determine if it's worth the effort.

Arseni Mourzenko
  • 134,780
  • 31
  • 343
  • 513
  • Thanks. The data I refer to here is look up data, which doesn't change over a fixed time period and the cache refresh is timed accordingly to have the latest data. Refresh is periodic batch job and the microservice treats the cache as primary source of the data. Circuit Breaker **will be** present on any and all of the primary/alternate routes I take; be it Happy path or Database or real time invocation or Request Caching. In cases with huge data where data change frequency is low, as I explained, would you prefer persistent storage or Request caching or anything different? – Divs Dec 28 '17 at 09:34
  • 1
    Thanks a lot for being patient in answering my question. Yes, it is indeed generation of the response that I missed to put in the question. Apologies for the same. Persistent storage indeed makes the situation complex... Will try to update here, what route I took and how it is behaving in production... Thanks again! – Divs Dec 28 '17 at 18:54