Responsibility of the Architect in Cloud DR Strategies

Any architect in the industry has seen their business – or one of their customers – sold on introducing a “cloud” service as the cure-call for what ails business IT. It’s almost a magic hand-wave to watch with the sales and marketing teams: outage times? Wave your hand and it’s no longer important. Multi-data-center deployment? The cloud does it all. Infrastructure redundancy? The cloud has that too, and even makes French fries in small sizes!

[This post is the first of 3 parts. Part 1 (today) addresses the business context that drives cloud DR, Part 2 will discuss assessing the platform, and Part 3 concludes with a case to look at how we might proceed in a simple customer example.]

Each provider has a Disaster Recovery whitepaper, potentially published availability information, and purpose built technologies to add “instant maturity” through the latest widget or acronym (at a nominal cost, of course). As architects, it can be easy to “buy in” to the platform! We have a solution that needs a certain level of availability, and a provider is committing to that SLA in published documentation without any negotiation, no application modification required!

Too often, we find out later that our responsibility as architects cannot “end at the door” to the provider, that there are provisos and implementation considerations we discover – often after the provider falls down.

As with all of our architect responsibilities, the business context should drive our analysis.
The first step in assessing cloud services for solution fit starts where any project analysis should – the business context. Does your set of stakeholders (and associated viewpoints) include the right people to put authoritative answers to questions around criticality of the application to the enterprise, the business value of availability (or the opposite – the damage of downtime?) What level of function is needed in the system for the application or solution to be “up” in the first place? Do those values and answers change from one key stakeholder to another?

The quality attributes for availability, responsiveness, and to a reasonable degree the intersection with usability of these two key attributes need to be well understood, and explored by the architect. The constraints implied by cost and potentially maintainability needs (lack of skills in supporting teams, for example) must also be drawn out to balance the investment and complexity that can be introduced through platform disaster recovery features.

Our customers’ (or our organizations’) applications drive the architect’s responsibilities in how they move to and operate in the cloud.
As we move from the business context – where the quality attributes and constraints are drawn out of the business intent – to the technical we need to start tracing some of these requirements (and cost constraint in particular) through some of the decisions we need to make for identifying a provider, and features within a provider to integrate into the solution.

Any managed service is defined by a contract with schedules and descriptions of what is “in the box” and excluded from “the box” for any given service component. These contracts often include very specific definitions of what the service levels are, how they are measured, and what penalties are attached for the cloud provider’s guarantee of service.
1) What are the Service Levels actually committed on? It is seldom as simple as everything being guaranteed to be available at a certain level.

Is availability tied to a certain function and not others?
Are any of the components between the customer/organization and the function (whether a VM, a platform service, or something else) guaranteed at a lower level than the function itself?

2) How are the Service Levels measured and reported?

What tools are in place to reliably measure the function of the service level?
Do you, as the customer or consumer of the service, have the ability to verify the service level either through instrumentation of your own, or the application?
Are the definitions of the reporting unduly to the service provider’s advantage? If someone offers X% availability, but only calculates for long periods of time, odds of ever receiving recovery of some sort for an outage will be rare.

3) How does the SLA recovery compare with the cost of an outage in the first place? At what point does the losses from an outage of the application overtake the credit or defined recovery from an SLA breach?

Remember that Availability is not the only service level to be concerned with! Data recovery, support time frames, responsiveness to “critical” rated incidents, and other concerns can be critical to making the right choice in service provider. A myopic focus on Availability only can potentially force us to miss the broader context of how this provider will support our customer or our organization! To be continued in part two, assessing the cloud platform.

Wayne Anderson (@NoCo_Architect) is an Infrastructure Managed Services Architect with Avanade, a company that helps customers realize results in a digital world through business technology solutions and managed services that combine insight, innovation and expertise focused on Microsoft® technologies. He has completed more than 30 Microsoft certifications in his career alongside credentials from CompTIA and other industry vendors. Mr. Anderson’s past roles include management of global certification with Avanade, as well as focus in information security and architecture.