TL;DR: Build redundant, modular; test for availability; monitor closely.
After realizing that trying to squeeze in any explanation might go very long so I will write down all the observations I have made.
Questioning the premise
Cloud system is panacea
Even if you were to go fully on cloud, with a top cloud provider, you will still need to design your application for resilience, grounds up. AWS might replace your VM, but your application should be capable of restarting if left in the middle of computation.
We don't want to use cloud system, because of x/y/z
Unless you are an ultra large organization, you are better-off using cloud systems. Top-3 cloud systems (AWS, MSFT, Google), employ thousands of engineers to give you promised SLAs and the easy to manage dashboard. Its actually a good bargain to use them in lieu of spending a dime on this in-house.
Problems in scoping and design
Defining, quantifying and then continuously measuring the availability of a service is a bigger challenge than writing solution for availability issues.
Defining and measuring 'availability' is harder than expected
Multiple stakeholders have a different view of availability, and what may happen is the definition preferred by a person with highest salary trumps other definition. This is sometimes correct definition, but often the eco-system is not built around measuring the same thing because that ideal definition is much tricky to measure, let alone monitor in real time. If you have a definition of availability that can't be monitored in real time, you will find your self-doing similar project again and again with eerie similarities.
Stick with something that makes sense and something that can be easily monitored.
People underestimate the complexities of the always available system.
To address the elephant in the room, let me say this: "No multi-computer system is 100% available, it may in future but not with current technology."
Here by current technology, I am referring to our inability send signals faster than the speed of light and such things.
All comp-sci engineers worth their salt know distributed computing limitations, and most of them will not mention it in meetings, being afraid they will look like noobs.
To make up for all those who don't mention distributed computing limitations I will say, its complicated but don't always trust computers.
People overestimate their/their engineer's capabilities
Unfortunately, availability falls in the category, where you don't know what you want but you know what you don't want. It is a bit trickier that 'know the wants' category such as UI.
It requires a little bit of experience and lot of reading to learn from other's experience and some more.
Building an available system from grounds-up
Make sure you will evangelize to every architecture and design team about the right priority of the availability as a system requirement.
Attributes of system helping availability
Following system characteristics have shown to have contributed to system availability:
Redundancy
Some examples of this are to never have only a single VM behind a VIP or never store only a single copy of your data. These are the questions that a good IAAS will make it easier for you to solve but you will still have to make these decisions.
Modularity
A modular REST is better than monolithic SOA. An even modular microservice is actually more available than the usual HATEOSREST. The reasoning could be found in Yield related discussion in next section.
If you are doing batch processing then better to batch processing in a reasonable batch of 10s compared to dealing with a batch of 1,000,000.
Resiliency
"I am always angry"
- Hulk
A resilient system is always ready to recover. This resiliency applies to instances such as acknowledging ACK for a write only after writing to RAID disk, and possibly over at least two data centers.
Another latest trending is to use conflict-free data structures, where data structure assumes the responsibility to resolve conflicts when presented with two different versions.
A system can not be resilient as an afterthought, it has to be predicted and built-in. A failure is guaranteed over the long term, so we should be always prepared with a plan to recover.
Log trail
This is technically a subtype of Resilience, but a very special one because of it's catch all capabilities. Despite the best effort, we may not be able to predict the pattern of unavailability. If possible, maintain enough log trail of the system activities to be able to playback system events. This will, at great manual cost, allow you to recover from unforeseen situations.
Attributes of availability
The non-exhaustive top-of-mind attribute list of 'availability':
For discussion sake, let's assume the question user asks is, "How many items do I have in my shopping cart?"
Correctness
Do you must produce the most accurate possible answer or is it ok make mistakes? Just for a reference, when you withdraw money from ATM, it is not guaranteed to be correct. If the bank finds out it made a mistake, it might you to reverse the transactions. If your system is producing prime numbers, then I would guess, you may want right answers all the time.
Yield
Skip this point, if you answered always-correct for the previous topic question.
Sometimes the answer to questions don't have to be precise, e.g. how many friends do I have on Facebook right now?
But the answer is expected to be in the ballpark +/-1 all the time. When you are producing expected result your yield is 100.
Consistency
Your answer may be correct at one point of time, but by the time the light has left the screen and entered the retina of the observer, things could have changed. Does it make your answer wrong? No, it just makes it inconsistent. Most applications are eventual consistent, but the trick is defining what kind of consistency model your application is going to provide.
By off chance your application can run on a single computer, your can skip this lovely reading on CAP theorem.
Cost
A lot depends on what total impact of short-term effects(loss of revenue) and long term effects (ill reputation, customer retention). Depending upon customer type (paying/free, repeat/unique, captive) and resource availability different levels of availability guarantees should be built in.
Towards improving the availability of an existing system
Operational management of individual machines and a network is such complex, that I assume you have left it to the cloud provider or you are already expert enough to know what you are doing. I will touch other topics under availability.
For the long term strategy Define-Measure-Analyze-Control is a heavenly match, something I have seen myself.
- Define what is 'availability' to your stakeholders
- How will you measure what you have defined
- Root cause analysis to identify bottlenecks
- Tasks for improvements
- Continuous monitoring(control) of the system
Causes of un-availability
Since we agreed that operational management which would cover any physical infrastructure management, ought to be done by professionals I will touch other causes of unavailability for completeness sake.
IMO availability should also include lack of expected behavior, meaning if the user is not shown expected experience, then something is unavailable. With that broad definition in mind, the following could cause unavailability:
- Code bugs
- Security incidences
- Performance issues