There has been a recent focus on Five Nines. From the top, the word is thrown around with the thought of it being as easy as buttering bread. The reality is that there are many factors that contribute to Five Nines and it’s not just money. Heck, your business probably doesn’t need it either.
Let us look at this chart below to gain some context of what Five Nines really means.
uptime year month day days hh:mm:ss days hh:mm:ss hh:mm:ss
99.9990 0 00:05:15 0 00:00:27 00:00:01 99.9900 0 00:52:34 0 00:04:28 00:00:09 99.9000 0 08:45:36 0 00:44:38 00:01:26 99.0000 3 15:36:00 0 07:26:24 00:14:24
5 minutes and 15 seconds per year. This highlights the amount of time that outages can amount to. The issue I have with it is that people fear monger and say you need Five Nines or you may as well not bother. That is not true. Evaluation of your requirements is often forgotten. I am here to remind you to think of YOUR requirements. Here outlined are some of my points.
- Well defined HA goal
- Hardware Redundancy
- Staff redundancy/Staff skill sets
- Thorough documentation
The dot points here can be broken down further.
Well defined HA goal
This needs to be something that is achievable to the business; it must be cost viable. If you are a 9-5 retailer then 24/7/365 isn’t important as the spend isn’t justifiable. Sure you need to ensure uptime during the day but you accommodate accordingly . The cost of HA versus the diminishing return of your spend needs to be considered. Plan what level of redundancy you need then drill down. Most businesses are not Google, Facebook, or Amazon. They do not require huge spend on HA. A small office may have dual ISP redundancy, a UPS for core devices/servers, only to enable and activate a shutdown procedure. This should be outlined when investing in technology and the risk of losing power completely on an extended outage has been added to a risk register. A larger office that produces web content 12 hours a day and has many remote workers is a different case. You must not forget what you do, how your business makes money, and ensure your spend is proportional with these things in mind.
Hardware Redundancy
What hardware must run to continue business? Do you have fail over technologies in place? Do you have spare equipment on site? What are the SLA’s like? Does your supplier actually care if this gets there on time? Abatement for them may not be much for them but can do irreparable damage to your brand. Building redundancy needs to be considered too. We all rely on power? Dual feeds from different sub stations, UPS, power board expansion without downtime, generators, are just some things to think of when it comes to power alone.
Staff Redundancy and Skill sets/Through Documentation
This is something that is often overlooked. Even the best trained people make mistakes. To err is to be human. Understanding the implications of your actions is important. Thorough understanding of technologies in place, topology, the traffic flows, application hierarchy all add to this. Also, if a staff member leaves do you find yourself with a knowledge gap? Documentation and procedure help mitigate risk in loss of knowledge. It will also allow someone to find their feet. Share the knowledge around. Allow everyone to know what is going on. Do not build silos; that though is a blog itself.
The laws of science and math are defined by machine parameters. They don’t have emotion or moods contrary to popular belief. They generally do not make mistakes. Mistakes are made by humans and a silly fat finger can cost millions. Ensure that there are processes to avoid outage. Peer review, discussion and teamwork. A second eye on something is beneficial to you and your team. You may learn something too. Everyone has a point of view.
Deep breath
Your manager or CTO may not want to hear this but does your enterprise really need 24/7/365? Sure, they may want to feel important enough to think they do but does the end justify the means? Does that last 0.1 percent justify the spend? Does the last 2 percent even justify the spend? Dual SUP, Dual Chassis or just Dual Chassis? Analyse your business requirements and determine where the risk lays. I don’t propose not thinking about HA because that is silly. I do propose thinking about diminishing returns. Don’t be a muppet and run around wanting Five Nines just because.
I think it is also worth mentioning that not even a Tier-4 data center can provide five nines. If a Tier-4 cannot provide 5 nines then nothing can. Right? So the infrastructure team (servers/virtualization,network) can try to work some kind of magic with dual power, dual circuits, dual servers, but they’re still at the mercy of the data center.
And if you’re at a campus – forget about it. How many campus out there have 96 hours worth of fuel for their generators or a contract with a fuel company to deliver during a 100 year event?
Absolutely. During a 100 year long outage that would be a extinction level event. I would have other things to worry about like getting to my family and surviving.
I suppose it just comes down to SLAs and the rest. AWS is a great example of one of the biggest providers can still have mistakes. Redundancy comes through importance of uptime versus diminishing returns on spend.