How to design a high-availability application

Question

We currently have a classic n-tier application: DB / web service / front-end. It has other components, but it's the basic layout.

We want to improve the application availability for 3 main reasons:

Our host sometimes experiences outages (as they all do), and we want to minimize the impact on our customers, so for instance, they would switch on datacenter B if datacenter A is down.
When we upgrade the version, we shut down the site for maintenance, and it usually takes a few hours (migration scripts, etc). We'd like the users to have a more seamless transition, with as minimal a downtime as possible (they use server B while server A is being upgraded).
Optionnaly, our customers are located around the world, and we want them to have the best experience possible despite their possibly crappy connections (anyone who worked with Indian devs should know what I mean). Ideally, we'd like to be able to plug a server in their office (or use a datacenter near their city), and it would integrate seamlessly into our architecture.

We don't remotely need 99% availability, not even 95%. It's a documents management app. Nobody cares. But since migrations can take a while, and there are customers around the world, sometimes we prevent a customer from working for most of their day.

For the SQL part, even though there aren't "proper" DBAs, we know about the SQL possibilities: replication, mirroring, etc. On the DB side, it's pretty easy to find resources for this. What is harder is everything else: storing sessions, the code, etc. If my webservice server goes down, how does my UI knows it must switch? How are my sessions persisted across servers?

Unfortunately, none of us have experience in this area, and we don't even know where to start looking. Are there best practices for this? Design patterns? Libraries (which should be free because we don't have money)?

We're using ASP.Net and SQL Server, with a WCF webservice in the middle. We have a bunch of Windows services lying around, but they are not mission-critical, and I assume the methods to deal with the website will be applicable to the services.

I understand that most cloud platforms provide a built-in system for this, but cloud hosting is a no-go because of our sysadmin, who want to manage everything themselves and not rely on anyone.

"what if they suddenly decide to sell our data to our competitors ?" Really? That's the best argument they've got? 1) Pretty sure that would be illegal. 2) No reputable hosting provider would do that (that would be undermining their entire business). 3) If you are really worried, make sure any signed agreements prohibit such things and sue if they break the agreement. 4) Encrypt your data. 5) What's stopping your current host from doing the same thing? — Becuzz, Dec 22 '16 at 15:08
In all seriousness though, avoiding using something pre-built for the exact thing you want is just going to lead to problems. You will have to learn every lesson about how to properly host a high availability system that these providers have already learned. And you probably won't have the resources and expertise to respond to problems as well as they will. If you (or the sysadmins) still insist on doing this, look into load balancing, session storage that isn't in-memory (like SQL session store), automated deployments, etc. — Becuzz, Dec 22 '16 at 15:13
@Becuzz: I'm exaggerating a bit there, but they have (in my opinion) mostly ungrounded and illogical arguments against cloud-hosting. They pretty much think they themselves are better than most hosters. What can I say? For the second point, we're not against using a library, but it has to be free or cheap, because we don't have a budget for this. — thomasb, Dec 22 '16 at 19:39
HA costs, both capex and opex because you need redundant hardware and a fair amount of dev & devops work to make HA work - if you haven't got budget for buying some tools, I doubt you can afford evolving & operating an HA setup. — Frederik, Dec 22 '16 at 20:20

score 7 · Accepted Answer · answered Dec 23 '16 at 05:13

You need to clarify what kind of high availability you're looking for. There are highly available applications that I run that need to be up 95% of the time. There are others that need to run at 99%. I can think of life-or-death scenarios that require 100% uptime. Just those three have drastically different approaches and costs.

Just guessing based on your needs and a 95-99% uptime SLA:

Database migrations should be able to happen in real time for most changes. Practice Evolutionary database design. For changes that do require more invasive behavior, you have a few options. One is take the downtime. If possible, running your service in read-only mode might work. For full functionality, I've been wanting to try ScaleArc for a while. It looks like a really slick tool for scaling and resiliency in the SQL Server world.
Putting servers inside your customer's sites is a recipe for an unmanageable disaster unless you've got world-class deployment strategies (which, based on your description of your migrations, you don't have yet). Don't push cloud services on-prem because you have performance problems. Solve the performance problems now and then you won't have to deal with costlier ones done the road.
Your state server should be a database of some sort. Follow their HA guidelines. You can use SQL Server for this, since you already have it available to you.
Speaking of databases, replication does not enable HA. In fact, SQL Replication will cause you headaches around every turn (speaking from experience with multiple node replication scenarios). Mirroring can work, but last I remember, SQL clustering takes 1-5 minutes to fail over to the new server. I've heard good things about AlwaysOn, but I'm still suspicious given Microsoft's track record. Something like ScaleArc might be more help here.
Your web server should be stateless. Spin up three or four and put them behind a load balancer. That solves your uptime worries there. As Frederik mentioned earlier, you can also do rolling deployments this way.
Your web service should probably be stateless. If not, see if you can break it apart into stateless and stateful bits. Putting multiple instances of it behind the same load balancer again solves uptime worries and enables more interested deployment scenarios (e.g. blue/green deployments).

Unlike Frederik, I won't call your cloud paranoia unwarranted. It depends on your uptime requirements. It is conceivable that a service would have to run in multiple data centers operated by different providers in different countries for redundancy's sake. Given your current state, however, I'd agree that AWS, Azure, or similar are probably safe bets for your company.

About the on-premise install: it's not a performance issue, it's a customer's bandwidth issue. They can be in places with unstable or slow connections. But it's not an important feature. Thanks for the rest, I'll look into it (them ?) — thomasb, Dec 23 '16 at 08:27

Frederik · Answer 2 · 2018-07-24T07:25:58.520

Getting some level of HA on your web & application tier:

Ideally, factor out any state, including session state into shared-state systems like a database or an in-memory session state server. Depending on your application design this may cause performance issues due to the added latency getting a large amount of state.
Your web site & application tier should each have an independent load balancer in front of them. NGINX will do the trick, but IIS can do this too (ARR).
If a single database can't handle the load, leverage session state partitioning (or sharding or consistent hashing) to route particular request to a particular database box.

If factoring out state is too hard, you can go with server affinity for load balancing (ie users are consistently routed to the same box, often cookie based). It's not as highly available as a stateless round robin approach, because a box outage will impact all users & state on that that box, but it beats a complete outage (use-case dependent).

On the upgrade side:

Design your database scripts in such a way that database upgrades can be done while the system is running, in other words, maintain backwards compatibility. A pattern that works well for that is "expand, then contract" -> make only additive, backwards compatible changes but removing dependencies on the fields (etc) that you want to get rid of; then upgrade all clients of the database to v-latest; then do another db-upgrade to get rid of the old fields (etc) in the database. This can be a slow process if you have a large database and you have to be careful to not scupper the performance of your system.
Upgrading your app tier: since your not using a cloud environment, I recommend you follow the canary deployment pattern: do a rolling upgrade of your web & middle tier boxes. If the deployment goes wrong, take the box out of the load balancer, just like you would as if it had failed.

Word of warning: evolving a system that hasn't been designed for HA into one that is, can be a long and costly process. You'll have to make trade-offs along the way (cost vs effort to reach a particular level of availability)

Your cloud paranoia is unwarranted - providers such as AWS in conjunction with good practice on your part can control / mitigate most risks - have a look at their compliance page to get a feel for what regulations they're compliant with: https://aws.amazon.com/compliance/

score 1 · Answer 3 · answered Dec 23 '16 at 09:30

TL;DR: Build redundant, modular; test for availability; monitor closely.

After realizing that trying to squeeze in any explanation might go very long so I will write down all the observations I have made.

Questioning the premise

Cloud system is panacea

Even if you were to go fully on cloud, with a top cloud provider, you will still need to design your application for resilience, grounds up. AWS might replace your VM, but your application should be capable of restarting if left in the middle of computation.

We don't want to use cloud system, because of x/y/z

Unless you are an ultra large organization, you are better-off using cloud systems. Top-3 cloud systems (AWS, MSFT, Google), employ thousands of engineers to give you promised SLAs and the easy to manage dashboard. Its actually a good bargain to use them in lieu of spending a dime on this in-house.

Problems in scoping and design

Defining, quantifying and then continuously measuring the availability of a service is a bigger challenge than writing solution for availability issues.

Defining and measuring 'availability' is harder than expected

Multiple stakeholders have a different view of availability, and what may happen is the definition preferred by a person with highest salary trumps other definition. This is sometimes correct definition, but often the eco-system is not built around measuring the same thing because that ideal definition is much tricky to measure, let alone monitor in real time. If you have a definition of availability that can't be monitored in real time, you will find your self-doing similar project again and again with eerie similarities. Stick with something that makes sense and something that can be easily monitored.

People underestimate the complexities of the always available system.

To address the elephant in the room, let me say this: "No multi-computer system is 100% available, it may in future but not with current technology." Here by current technology, I am referring to our inability send signals faster than the speed of light and such things. All comp-sci engineers worth their salt know distributed computing limitations, and most of them will not mention it in meetings, being afraid they will look like noobs. To make up for all those who don't mention distributed computing limitations I will say, its complicated but don't always trust computers.

People overestimate their/their engineer's capabilities

Unfortunately, availability falls in the category, where you don't know what you want but you know what you don't want. It is a bit trickier that 'know the wants' category such as UI. It requires a little bit of experience and lot of reading to learn from other's experience and some more.

Building an available system from grounds-up

Make sure you will evangelize to every architecture and design team about the right priority of the availability as a system requirement.

Attributes of system helping availability

Following system characteristics have shown to have contributed to system availability:

Redundancy

Some examples of this are to never have only a single VM behind a VIP or never store only a single copy of your data. These are the questions that a good IAAS will make it easier for you to solve but you will still have to make these decisions.

Modularity

A modular REST is better than monolithic SOA. An even modular microservice is actually more available than the usual HATEOS REST. The reasoning could be found in Yield related discussion in next section. If you are doing batch processing then better to batch processing in a reasonable batch of 10s compared to dealing with a batch of 1,000,000.

Resiliency

"I am always angry"
                    - Hulk

A resilient system is always ready to recover. This resiliency applies to instances such as acknowledging ACK for a write only after writing to RAID disk, and possibly over at least two data centers. Another latest trending is to use conflict-free data structures, where data structure assumes the responsibility to resolve conflicts when presented with two different versions. A system can not be resilient as an afterthought, it has to be predicted and built-in. A failure is guaranteed over the long term, so we should be always prepared with a plan to recover.

Log trail

This is technically a subtype of Resilience, but a very special one because of it's catch all capabilities. Despite the best effort, we may not be able to predict the pattern of unavailability. If possible, maintain enough log trail of the system activities to be able to playback system events. This will, at great manual cost, allow you to recover from unforeseen situations.

Attributes of availability

The non-exhaustive top-of-mind attribute list of 'availability': For discussion sake, let's assume the question user asks is, "How many items do I have in my shopping cart?"

Correctness

Do you must produce the most accurate possible answer or is it ok make mistakes? Just for a reference, when you withdraw money from ATM, it is not guaranteed to be correct. If the bank finds out it made a mistake, it might you to reverse the transactions. If your system is producing prime numbers, then I would guess, you may want right answers all the time.

Yield

Skip this point, if you answered always-correct for the previous topic question. Sometimes the answer to questions don't have to be precise, e.g. how many friends do I have on Facebook right now? But the answer is expected to be in the ballpark +/-1 all the time. When you are producing expected result your yield is 100.

Consistency

Your answer may be correct at one point of time, but by the time the light has left the screen and entered the retina of the observer, things could have changed. Does it make your answer wrong? No, it just makes it inconsistent. Most applications are eventual consistent, but the trick is defining what kind of consistency model your application is going to provide. By off chance your application can run on a single computer, your can skip this lovely reading on CAP theorem.

Cost

A lot depends on what total impact of short-term effects(loss of revenue) and long term effects (ill reputation, customer retention). Depending upon customer type (paying/free, repeat/unique, captive) and resource availability different levels of availability guarantees should be built in.

Towards improving the availability of an existing system

Operational management of individual machines and a network is such complex, that I assume you have left it to the cloud provider or you are already expert enough to know what you are doing. I will touch other topics under availability. For the long term strategy Define-Measure-Analyze-Control is a heavenly match, something I have seen myself.

Define what is 'availability' to your stakeholders
How will you measure what you have defined
Root cause analysis to identify bottlenecks
Tasks for improvements
Continuous monitoring(control) of the system

Causes of un-availability

Since we agreed that operational management which would cover any physical infrastructure management, ought to be done by professionals I will touch other causes of unavailability for completeness sake. IMO availability should also include lack of expected behavior, meaning if the user is not shown expected experience, then something is unavailable. With that broad definition in mind, the following could cause unavailability: - Code bugs - Security incidences - Performance issues

Interesting but not very helpful, and a bit off-topic. Thanks anyway. — thomasb, Dec 23 '16 at 13:14