Understanding Maximum Tolerable Outage

Before we go any further lets agree some terms:

  • Downtime means the time during a regular working period when an application is not actively productive. For example, an ecommerce application is considered “down” if end users are not able to complete a transaction (buy goods and/or services).
  • Uptime means the time during a regular working period when an application is actively productive.
  • Maximum Tolerable Outage (MTO) means the maximum amount of time the business can tolerate the application being down.

When talking to prospects, most decision makers like to talk about uptime. Often I hear people say they want “100% uptime” or “five nines” (99.999%), however for us it is a lot more powerful to understand a prospects Maximum Tolerable Outage (MTO) and I will explain why.

What’s Your Maximum Tolerable Outage?
If the power fails in a hospital, it’s easy to image that some systems (such as life support) need it back on in a matter of seconds, and people could literally die if it doesn’t! If you asked the person responsible for the life support systems “how much uptime does this system need to achieve?” they will probably say “it has to be 100% up”. This is the problem with talking about uptime; given the choice people want their systems and applications up all the time.

In the example above however, you asked the same person “If you did lose power, how quickly do you need it back up?” they would give you a much more detailed answer, and they would also be able to reason with you why. When you are talking with prospects, understanding the MTO of an application is a key part of the sales cycle

To make sure you know the MTO you need to ask:

  1. What does the application do (internal email, company website etc)?
  2. If the application was down, how quickly do you need it back up and running?
  3. What is the impact to you if the application is down (loss of revenue, loss of reputation etc.)?
  4. What would happen if the application was down for longer than X (where X is the answer to question 2)?

On the surface you may think just asking question 2 is enough, but I often find that a prospect’s initial reaction may not be that well-grounded. A little digging can often help ensure that there is actual business pain over downtime, and so a well-grounded business case for investment in to a hosting platform.
….and don’t worry, hospitals have lots of generators and power lines to make sure they always have power!!

When to Bring in an Architect

Sometimes I read an article that resonates with me as a Solutions Architect. A while ago I read the following article and it really clarified my thinking:

http://johncritchley.com/solution-architecture/architecture-on-purpose-the-3-stages-of-architecture-development

As an architect it is my job to have a broad range of knowledge across a range of technologies and concepts. The trade off with having this broad knowledge is that I will never know as much about a single technology as a specialist in that area. For example, it is my responsibility to understand the difference between storage technologies (DAS, NAS, and SAN etc etc.), disk configurations (differences RAID types, disk performance etc etc.), and storage replication options, but I will never have the same depth of knowledge as an experienced Storage Engineer than has worked on EMC arrays for a few years. Often the Architect is mistaken as just a “jack of all trades” and so is not always seen as an integral piece of the conceptual design. He or she may end up being told by management:

“We need to stand up a new social blog. Our Developers said we will use WordPress, and it needs to be up 99.999% of the time. Go talk to EMC because we’ll need a SAN, and we have budgeted £50,000 over 3 years to get this all done”

What I don’t like about this statement is that the solution has conceptually already been designed, but who has made these decisions? In reality decisions have probably been made by people who didn’t take an informed view of the options. To give some examples:
- The Developers decided to use WordPress. Was that the best tool for the job or was it the easiest for them?
- Someone decided that a SAN was needed. Has someone compared vendors and storage technologies to ensure EMC is the best vendor and SAN is the right technology?
- Management or the Board has decided the uptime goal for the project, but is the software and hardware up to the task?
- Finance has decided the budget, was that based on a complete business case or a finger in the air gestimate?

The benefit of getting the Architect involved at this conceptual stage is knowing that, whilst the finer details need to be worked out, the overall design is well grounded both conceptually and technically. The danger of us not being involved early on? The jigsaw pieces don’t fit together properly, a whole bunch of workarounds and compromises are made, and ultimately it ends up being a hell of a lot more work (and therefore cost) than it should have been.

The NIST Definition of Cloud Computing

On a day to day basis I have a lot of discussions around “cloud”, and what people define as “cloud computing”. Handily the National Institute of Standards and Technology (NIST) has already done this for us, and their definition is below. This definition can also be found at the link provided at the end of the article.

The NIST Definition of Cloud Computing
Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is composed of five essential characteristics, three service models, and four deployment models.

Essential Characteristics:
- On-demand self-service. A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service’s provider.
- Broad network access. Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).
- Resource pooling. The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter). Examples of resources include storage, processing, memory, network bandwidth, and virtual machines.
- Rapid elasticity. Capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.
- Measured Service. Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported providing transparency for both the provider and consumer of the utilized service.

Service Models:
- Cloud Software as a Service (SaaS). The capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure. The applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web-based email). The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.
- Cloud Platform as a Service (PaaS). The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations.
- Cloud Infrastructure as a Service (IaaS). The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls).

Deployment Models:
- Private cloud. The cloud infrastructure is operated solely for an organization. It may be managed by the organization or a third party and may exist on premise or off premise.
- Community cloud. The cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be managed by the organizations or a third party and may exist on premise or off premise.
- Public cloud. The cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services.
- Hybrid cloud. The cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load-balancing between clouds).

I do largely agree with the definition, however I also agree with their note that “Cloud computing is still an evolving paradigm. Its definitions, use cases, underlying technologies, issues, risks, and benefits will be refined in a spirited debate by the public and private sectors. These definitions, attributes, and characteristics will evolve and change over time.”

http://www.nist.gov/itl/cloud/upload/cloud-def-v15.pdf

Designing with Downtime, not Uptime, in Mind

Understanding how much downtime you can afford for your application is a fundamental necessity when looking at the hosting of your applications. Defining tolerable downtime objectives rather than concentrating on hitting uptime targets a may seem like a strange approach; however I have always believed that this is the correct approach during solution design.

To clarify, in this article downtime is defined as the time during a regular working period when an application is not actively productive. For example, an ecommerce application is considered “down” if end users are not able to complete a transaction and buy goods and services.

When hosting an application, most high decision makers like to set key performance indicators around uptime. Often I hear people talk about “five nines” (99.999%) when trying to define how resilient an application’s infrastructure should be, however for the Solution Architect uptime cannot be directly used as a way to design the right platform. The problem with uptime is that it is an output not an input. To put it another way, if I want to know how much uptime my car will achieve this year I have to wait till the end of the year to work it out. I cannot tell you in January what my uptime will be, even with a crystal ball!

To provide the point, the way to calculate your uptime or “Nines” rating is as follows:

% Uptime  = (Time Period – Downtime) / Time Period) x 100

The Solutions Architect must concentrate on the downtime input to the above calculation, as this is within the businesses power to control. To avoid accruing downtime, any business will need to understand (and plan for) the following:

  • Maximum Tolerable Outage (MTO) – The maximum amount of time the business can tolerate the application being down. In my opinion this is the single most important metric when looking at the hosting of your applications.
  • Risk – What is the likelihood of something failing, and the impact if it does?
  • Incident Detection Capability – How long does it take you to identify that an incident has occurred?
  • Troubleshooting and Invocation – How quickly can you understand the incident and invoke a recovery plan?
  • Recovery Time Objective (RTO) – How long will it take for an application to be running again?

I will be covering the above points in other articles, but for now the diagram below illustrates how the MTO and RTO fit in to a business’s “road to recovery”:

Road to Recovery Diagram

The Challenges of Long Term Data Archiving

In my day job I am constantly battling with prospects need to store large amounts of data, for long periods of time. With growing compliance needs (S.O.X springs to mind) and a general perception that storage is almost free (thank you SATA 1TB+ drives), the need to store data for long periods of time is becoming much more common. Regardless of why data needs to be archived, keeping it safe and accessible can be very tricky. I have compiled a list below stating some of the considerations any organisation should take in to account when designing its data archiving policies:

Storage Medium
Firstly you need think about your storage medium (i.e. what the data will be stored on). Conventional wisdom says “write it to tape”, but it’s important to understand how long storage mediums will last before committing. Typically you will have the following choices:
- Tape
- Disk (in a storage array)
- Disk (portable USB)
- CD
- DVD

Each of the mediums above have a definitive shelf life. Tapes demagnetise over time (my crates of old ZX Spectrum games are a great example), individual hard drives can rapidly break and cannot be trusted after a 2 years or or less. Keeping data on spinning disks in a larger array, such as an EMC Data Domain, will incur a higher costs.

Choose What to Archive
Long term data arching can be an expensive operation cost, so it is important to creep data sets to a minimum. A hosted application may run over a number of Operating Systems, so identify where the important data resides and farm it off to the long term archive. Non-critical data will still need to be backed up to help with Disaster Recovery, but this only needs to be for a week or two.

Storage Format
Locking a tape away for up to 10 years is perfectly feasible, but when it comes to needing to get data off it, can you guarantee you will have a storage device that can do the job? To put it another way, if you found a MiniDisk or Zip disk in a cupboard what would you have to put it in?

It may seem crazy to think of CDs or DVDs being historical formats, but it is very difficult to predict what will stand the test of time. Picking main stream formats is one approach, but also be mindful of what your vendors product timelines look like.

Data Format
Have you ever been in the situation where you cannot open a file? No matter what programme you use it’s just not recognised, and it takes a good few Google searches to find the one bizarre application that uses it. When storing data for long periods of time, common file formats should be chosen, and if there are any doubts then the relevant software should also be archived.

Revisiting old archives
Archives should be revisited regularly (once every 12-24 months) and data integrity checked. During these checks data can be rotated on to new media. A process of monitoring storage and data formats can also be kicked off, to evaluate if they are becoming end of life.

Data Redundancy
Critical data should never reside on a single piece of media, hosted by a single company. Long term archives should be made up of 2 or more copies of the same data sets to help mitigate against the risks of floods, fires, and (dare I say) riots. At the same time, copies of critical data should not all be held by the same vendor. In these torrid financial times, what if the company you choose goes under half what through the archiving period? It is easy to agree what will happen in contracts, but if the organisation you have contracted to no longer exists what then?

Organising Data
As an archive grows to multiple media sets, and many potentially TB’s of data, knowing which tape/drive/file has the data you want becomes more critical. In some situations you may be required to have locked bins shipped back to your data centre, and a mistake on which bin to call back could be costly. Ensure you are diligent with your documentation, and label tapes properly to avoid confusion.

Online (Spinning) or Offline (at Rest)
When building an archiving policy, use your Recovery Time Objective as a guide to deciding whether an archive should be online (readily accessible on an archiving server or an array like a Network Attached Storage), or offline (on tape or other removal media).

Holding archives on online will incur a greater storage cost per GB, however it will a lot easier to access when you need it the most. Large volumes of data may be impractical to keep online (re-emphasising the need to archive critical data only).

Data written offline will incur a much lower storage cost per GB, and is a much more practical way of archiving large data volumes. In my experience most backup requests are for data that is less an 72 hours old, so daily tape rotations may lead to lengthy recovery times.

Long Contracts
I regularly hear prospects asking for long term archiving services (5 years or more), but then not being prepared to sign a contract of the same length. If you want a service provider to hold your data for many years, reside yourself to signing that commits you both.

Encryption
Security frameworks often ask for data to be encrypted, including when written to archive media. If encryption is required investigate the best way to encrypted data (strength of encryption, known vulnerabilities etc), what level to encrypt the data (application level, media level etc), and how you plan on managing the keys. Getting hold of a 5 year old tape that you cannot decrypt because you lost the key will be a real pain, but writing data and the encryption key to the same media set may not be sensible.

Welcome to HeadintotheCloud.com

Hello everyone, and welcome to HeadintotheCloud.com!

HeadintotheCloud.com is a tech blog that aims to identify the challenges of moving applications and data in to the cloud. The blog is written and authored by me, and I work as a Solutions Architect for one of the worlds largest hosting companies.

I hope the site and it’s articles are useful to you, and please feel free to leave a comment with your ideas/feedback.