No one wants to go to their favorite website or app and find out that it's unavailable, and business owners that rely on your app don't want to find out that they can't serve their own customers because your service is down. As a product manager or software developer, it's your job to ensure that users' expectations are being met. But you may be asking, how much availability is enough? It starts with asking the right questions.
First, it's important to understand what availability is and what it isn't. Unless server hosting is your business, availability has less to do with whether your servers are running and more to do with the actual services you provide.
What we're ultimately looking at is access to your services and it's up to you to determine what access means. If, for example, your users are able to login to their dashboard, but when they get there, they can't do anything, you would probably say that your services are unavailable even though the login service works.
On the other hand, if the user can log in and do everything they normally can except for something small like change their email address, then you probably wouldn't count that towards availability.
Typically you hear availability discussed as a number of "nines". You may see that a company promises their customers 99.9% availability. Sometimes they may even refer to this as "uptime". When a company promises 99.9% availability it means they are guaranteeing that their services will be available for all but 8.76 hours per year. This may be refered to as "three nines" of availability.
To help you determine what should be counted towards availability, you can ask yourself:
- Will the customer lose money because of this outage?
- Will my business lose money because of this outage?
- Will this impact the customer's productivity?
- Will this harm the customer's reputation?
In order to determine your availability, you need to first calculate your downtime, which can be divided into two categories: scheduled and unscheduled.
total_downtime = scheduled_downtime + unscheduled_downtime
And then to get availability:
availability = (1 - (total_downtime / (365 * 24))) * 100
Or, a simpler way (where uptime is 365 * 24):
availability = uptime / (uptime + total_downtime)
(Total downtime is in hours)
One way to determine your scheduled downtime is to come up with a time during the week when you think an outage would have the least impact. Then estimate how long it would take for you to perform an update. Then consider how long you could go without making a big update, whether it's a week, 2 weeks or longer.
You could also express this as a number of updates per year multiplied by the average time per update.
scheduled_downtime = number_of_updates * hours_per_update
To determine your unscheduled downtime, you might just double your scheduled downtime or multiply it by some other factor.
unscheduled_downtime = scheduled_downtime * unscheduled_downtime_factor
To give an example, let's say you expect to have little to no activity on a typical Sunday morning between 9am and 12pm, so you make that your maintenance window. You reserve that timeslot every Sunday in a given year for a total of 156 hours, but you only expect to use half of that on average. Your total scheduled downtime is 78 hours per year.
Then to be on the safe side, you plan for an equal amount of unscheduled downtime. That brings your total downtime to 156 hours and which gives you an availability of 98.3%.
8760 / ( 8760 + 156) = 98.3
Consider another scenario where you needed to have a minimum availability of 99.9%. How much downtime each week could you have if you expected an equal amount of scheduled downtime to unscheduled downtime?
To meet this requirement, you would be limited to 8.76 hours of downtime each year, or about 10 minutes per week. If half of that goes to scheduled downtime, then you only have 5 minutes each week to perform your update. If your update requires more than 5 minutes, then you might want to do it every other week and have 10 minutes, or once a month and have 43 minutes.
Each system and codebase are unique, so to make sure you have a good estimate of your downtime, try asking yourself these questions.
- How much time do I need to update and reboot my servers?
- How long would it take for me to recover from an unresponsive web server?
- How can I alter my deployment process to minimize/eliminate downtime?
If you've run the numbers and you're not satisfied with your availability, consider changing your processes so that you don't need as much downtime.
To give yourself a little extra wiggle room, consider adding a buffer on top of whatever number you came up with for your availablity. One way to do this is to simply round down to the next nearest 9 or 5.
If you're availability is at 99.8% consider dropping that to 99.5%. If you're at 99.95%, consider dropping that to 99.9%
Once you know what you can reasonably offer for availability you should consider whether it is even necessary to promise your customers anything.
Try asking yourself:
- Can my customers continue to conduct business without my service?
- Will my customers loose a lot of money if my service is offline for more than a couple of hours per week?
- Can I afford to credit or refund my customer for longer outages?
If you do promise a certain amount of availability to your customers you may do so through a Service Level Agreement or SLA. In doing this you may also include a penalty for not meeting this promise.
Hosting companies typically offer their customers an SLA. For example, Amazon Web Services promises its compute customers "Monthly Uptime Percentage of at least 99.99%" (https://aws.amazon.com/compute/sla/). If availability drops below 99.99% in a given month, Amazon gives their customers a 10% service credit, and if availability drops below 95% the customer is given a 100% credit.
If you're just getting started with your first customer, you may not want to promise anything, especially if the customer doesn't ask. But even if you don't promise anything it's good to have a number to aim for.