The Architect in Cloud DR Strategies

The Architect in Cloud DR Strategies - Application

To explore the responsibilities of the architect deploying to the cloud for DR and availability in better detail, let’s introduce a notional “customer” with one of the simplest application scenarios. There are many more complicated scenarios that are every-day experience for many of us, but as with many of our strategic roles, start with (and master) the simple in order to deliver the more complex successfully!

[This post is the second of 3 parts. Part 1 addresses the business context that drives cloud DR, Part 2 will discuss assessing the platform, and Part 3 (today) will conclude with a case to look at how we might proceed in a customer example.] Contoso Widget Systems is a recent spin-off of Contoso, Ltd. Contoso has a simple “flat” web application which accepts service requests through an online request form, and stores them in a database which other applications access – including a CRM system that does service request assignment. Without this application functioning, an entire directory of the business can only operate for a limited period of time before the service trucks stop having new places to go, and customers’ expectations are unmet, possibly even triggering contract penalties from latent service on past sales.

In our notional “Contoso” example, an architect would work with the technical leadership to understand what the cost baseline of the application has been in the past as a starting point, as well as key executives members in both service and finance organizations, in order to drive out key requirements and constraints. Some of the key notes that should have come out of that discussion could be:

The past capability in the datacenter was an administration load of ~4 servers.
There is not much extra capacity in the administration team to run many more servers than that.
The new Widget firm needs to stay asset light and keep as little capital on the books as possible, while keeping operational expenses under the old total of running things on-site.
The application will idle the entire service organization if an outage lasts more than 24 hours.
Each hour of downtime within that first day would be a couple thousand USD in losses and extra service time, but a few minutes of unexpected down time would be acceptable if recognized and remediated quickly.
Each hour after that could be $25,000 or more per hour in business impact.
When the application was part of Contoso, Ltd, the service department supported it with “3 nines” commitment to the business, now that they are on their own, the Widget support department would prefer to get a better commitment if that is possible from the service provider in the same money.

As Contoso’s notional architect, these points would likely drive us to look at how different service providers map to the expressed requirements and constraints. For the purpose of our discussion, let’s consider Microsoft’s Azure, and Amazon Web Services (there are many other possible providers out there, one of my personal favorites when the larger offerings do not fit is a recently acquired company called Tier3 that is now CenturyLink Cloud).

For Microsoft’s Windows Azure platform, we would probably review the pages associated to the service (SLA and Support details are at the bottom of the page).

Virtual Machines Pricing Details http://www.windowsazure.com/en-us/pricing/details/virtual-machines/
Virtual Network Pricing Details http://www.windowsazure.com/en-us/pricing/details/virtual-network/
Storage Pricing Details http://www.windowsazure.com/en-us/pricing/details/storage/
Recovery Manager Pricing Details http://www.windowsazure.com/en-us/pricing/details/recovery-manager/
Support Plans http://www.windowsazure.com/en-us/support/plans/

For Amazon Web Services, we would need to consider similar information.

EC2 Compute Resources (SLA document includes details for EBS storage) http://aws.amazon.com/ec2/sla/
RDS Managed Database Service http://aws.amazon.com/rds/sla/
We might also look at some of the availability and load management services:
- Elastic Load Balancing http://aws.amazon.com/elasticloadbalancing/
- Auto-Scaling http://aws.amazon.com/autoscaling/
- Cloud Formation http://aws.amazon.com/cloudformation/

Key points that we should come away with here as architects would include understanding that in Azure, to get to 99.95% availability, we need 2 or more VMs in an Availability set. Amazon Web Services uses a 2 tier recovery model for service level definition where 30% service fee recovery ONLY happens when availability on the month drops below a larger outage threshold. We might consider how and whether other services available on both platforms could let us simplify management through managed database services and whether those services had commitments that met the needs of Contoso widgets.

REMEMBER: Service Level Agreements (SLA) deal primarily with systems. Also look at the Organizational Level Agreements (OLA) internal to your customer/organization as well as at the cloud provider! Can you get someone from the business continuity function from your organization in the middle of the night? How? Will you be able to get support quickly from your cloud vendor? What standard(s) do you have to meet in a DR event? What if that DR event is due to the action of YOUR team? Does that change the provider's availability and support?

In the end, there are hundreds of different service configurations which would allow us to offer two or more front-end servers to the database which stores information, and then potentially provide connectivity securely to the upstream applications which would do the assignment.

We would likely have several different views for this specific architecture that correspond to the viewpoints that address our stakeholders:

Financial Executive (less than half of CIOs report to CEOs, at many companies we may have a CFO or similar that ultimately will review projects for approval)
Service Executive / Management Team
IT Management/Leadership

To which we would add the requisite implementer views of our recommendations:

Infrastructure physical design
Application diagram(s)
Recommendations for application update(s)
Input into run / operational processes

Finding the right combination of provider, features, and then describing what that is, how it relates to the functional attributes (and constraints) and what the business impact is for the organization of our recommendation is where our value as architects is apparent to the organization. Successful delivery is required business function delivered in a maintainable system that meets requirements (including those requirements captured as quality attributes) while staying within constraints (including cost).

[The author has larger real-world examples that can be sanitized with some effort to examine this subject more closely with real-world examples, decisions made, and impacts. If you are interested, please reach out, like, or comment on this post, because sanitizing them for discussion is a non-trivial effort and will only be undertaken if there is community interest.] Wayne Anderson (@NoCo_Architect) is an Infrastructure Managed Services Architect with Avanade, a company that helps customers realize results in a digital world through business technology solutions and managed services that combine insight, innovation and expertise focused on Microsoft® technologies. He has completed more than 30 Microsoft certifications in his career alongside credentials from CompTIA and other industry vendors. Mr. Anderson’s past roles include management of global certification with Avanade, as well as focus in information security and architecture.