The Wayback Machine - https://web.archive.org/web/20121027023324/http://cloudcomputing.sys-con.com:80/node/2416841

Welcome!

Cloud Expo Authors: Elizabeth White, Pat Romanski, Liz McMillan, Rob Sobers, James Sun

Related Topics: Cloud Expo, Java, XML, SOA & WOA, AJAX & REA

Cloud Expo: Blog Post

Lessons Learned from the Amazon Web Services Outage

The only surprising thing about this AWS outage was that anyone was surprised by it

On Monday, Amazon Web Services — the leading provider of cloud services — suffered an outage, and as a result, a long list of well-known and popular websites went dark. According to Amazon’s Service Health Dashboard, the outage started out as degraded performance of a small number of Elastic Bloc Store (EBS) storage units in the US-EAST-1 Region, then evolved to include problems with the Relational Database Service and Elastic Beanstalk as well.

AWS outage takes down Reddit

WEBSITE DOWN: AWS outage takes down Reddit and other popular sites

The only surprising thing about this AWS outage was that anyone was surprised by it. It wasn’t the first time AWS had a major outage or problems with this data center. If you remember, back in June a line of powerful thunderstorms knocked the power out at a major Amazon hosting center. The backup generator failed, then the software failed, and, well, you know the drill. A corollary of Murphy’s Law is that if multiple things can go wrong, they will all go wrong at once.

In both of these instances (and in all Amazon Web Services outages, in fact) some customers were knocked “off the air” while others continued running without a hiccup. You would think that eventually companies will learn to anticipate the inevitable AWS outages and take active steps to prepare for them. There are best practices and solutions on how to reduce vulnerability to an outage, but they’re rarely implemented. That’s because people don’t think that anything could happen to Amazon — obviously, things happen.

Instances like this are a learning opportunity if we take the time to think about why they happened and what could have been done to prevent them. Here are six lessons that I think we can learn from the Amazon Web Services outages.

Lesson 1 — Clouds are made of components that can fail. When people think of the cloud, they think that there is some amorphous and untouchable blog up in the sky. And while that’s a nice bit of marketing, it is not a useful model for operational planning. Be mindful of your cloud provider’s architecture and how it is built to manage failure of a component or a zone blackout. Then anticipate that failures can happen at any point in the cloud infrastructure.

Lesson 2 — The stress of failure will trigger a cascade of other failures. After reading a description of the outage, you get the sense that it was just one thing after another. What started as a small issue affecting one Northern Virginia data center quickly spread, causing a chain reaction and outage that disrupted much of the Internet for several hours. Remember Murphy and his law?

Lesson 3 – -Spikes matter. When a cloud fails, hundreds of customers are impacted. As they try to recover, they will be stressing the cloud provider’s infrastructure with a peak load that is guaranteed to cause even more problems. If you get these transition spikes, they get worse and worse. Every time you reboot, it takes longer and longer. If you have ten servers doing that, that’s bad. If you spike a thousand servers, that’s really bad. Something that would have taken five minutes to fix will now take five hours when you get into that transition type of syndrome.

Lesson 4 — Cloud providers provide the tools to manage failure, but it is up to you to put your own failover plans in place. AWS, for example, is broken into zones. If a component in the Virginia zone goes down and the whole matrix is dead, then (in theory) you should be able to move all your data to another zone. That other zone might be hosted, unaffected, in Ireland and then you are up and running again. This is one of the big differences between the cloud and more traditional approaches to IT. It is up to the application (and by extension, the application’s designer) to manage its interaction with the cloud environment, up to and including failover. Most cloud providers offer tools and frameworks to support failover, but you are responsible for implementing that best practice into your system operation and into the applications.

Lesson 5 — You need to put your failover plans through a full-blown load test. It’s not enough to have a strategy in place for failover. You have to test it under real-world conditions. Even the best laid failover plans, once implemented and designed, might have hiccups when a real outage occurs. A full-blown cloud load test can help you see how long the failover process will take to kick in and what other dependencies might need to be sorted out. Obviously this isn’t easy. If it was, Reddit, Foursquare, Airbnb and others wouldn’t have been impacted by the AWS outage.

Lesson 6 — Conduct fire drills. While a load test will confirm that your failover plan works as you expect, it will also give your team some real experience in executing the plan. Remember the fire drills you used to do in school? Fire drills help train students, teachers, and others to know exactly what they’re supposed to do and where they’re supposed to go in the event of an emergency. All the bugs in the process are worked out during the fire drill, and the more everybody does the drills, the more comfortable there are with what they need to do. And if a real emergency happens, everybody knows how to leave the building calmly. You want to do the same thing with your failover plan, and load testing can help you get there. Fire drills save lives and load tests save cloud apps.

Is your failure worth more than $28?

Amazon offers reimbursement to its customers based on the amount of downtime the customer experiences. The last time our Amazon Web Services went down, we got a $28 reimbursement. So my final lesson learned (I guess this makes for seven lessons) is this: The cost of downtime for your organization — in lost revenue, poor customer experience, etc. — is far, far greater than just what you are paying your cloud provider. $28 is not going to save your day. You have to make sure that you have a failover solution that’s ready and working. Don’t wait for Amazon to solve this problem for you, because it’s only a $28 problem for it.

The biggest lesson learned from these AWS outages is that you need to configure properly and you need to train your people. These types of events will always happen, and when they do, you need to be trained ahead of time. Load testing itself is a good way to validate and train. That way when a real emergency occurs, your team can react in a calm, collected manner to a situation they’ve experienced dozens of times before.

Read the original blog entry...

More Stories By Sven Hammar

Sven Hammar is Co-Founder and CEO of Apica. In 2005, he had the vision of starting a new SaaS company focused on application testing and performance. Today, that concept is Apica, the third IT company I’ve helped found in my career.

Before Apica, he co-founded and launched Celo Commuication, a security company built around PKI (e-ID) solutions. He served as CEO for three years and helped grow the company from five people to 85 people in two years. Right before co-founding Apica, he served as the Vice President of Marketing Bank and Finance at the security company Gemplus (GEMP).

Sven received his masters of science in industrial economics from the Institute of Technology (LitH) at Linköping University. When not working, you can find Sven golfing, working out, or with family and friends.

Cloud Expo Breaking News
Companies of all sizes are dealing with exponential growth of digital data and looking for cost-effective ways to secure, provide access, store and manage this data. Many are turning to the cloud to minimize their infrastructure and footprint. But is the cloud really solving our data management challenges? In his session at the 11th International Cloud Expo, Praerit Garg, President and Co-founder of Symform, will explore new approaches to managing data in the cloud, looking at distributed mode...
Clouds have what you need: reliable and highly resilient compute, networks and storage. But achieving true scale and efficiency requires designing and building your cloud from the ground up. In their session at the 11th International Cloud Expo, Kedar Poduri, Director of Product Management, Cloud Platform Group, at Citrix Systems, and Anantha Kasetty, Senior Sales Engineer, Cloud Platform Group, at Citrix Systems, will discuss the core capabilities and newest innovations in the Citrix CloudPla...
“The basic premise for any central computing system optimized for mass consumption is the 80/20 rule. It can be built only to serve 80% of the needs in an economized and optimized fashion,” noted Chetan Patwardhan, CEO of Stratogent, in this exclusive Q&A; with Cloud Expo Conference Chair Jeremy Geelan. “Having said that,” Patwardhan continued, “the so-called cloud economics work only for a certain type of system and is outright prohibitively expensive for most enterprise setups where a typical t...
Some apps are in the cloud. Some are not. Some components of an app are in the cloud, some components are not. Some code is in Java, some is in Ruby, and some is in Python. Some data is relational. Some is not. You used the best language, best framework, best database, and best deployment platform for the job. Great. Now what? In his session at the 11th International Cloud Expo, Bill Hodak, Director of Product Marketing at New Relic, will explain how you have to manage it, monitor it, and sc...
Cloud computing is challenging the way infrastructure and applications are built, delivered and consumed. Providing an open, scalable hypervisor layer is key to designing a cloud solution that can meet the needs of next-generation IT. In his session at the 11th International Cloud Expo, Marc Trouard-Riolle, Sr. Product Manager, Cloud Platforms Group, at Citrix, will discuss how virtualization can be optimized for cloud use cases and how some of the world’s largest clouds are leveraging Citrix ...
Cordys helps Cloud Service Provider customers deliver differentiated products, services and experiences in an agile and flexible framework. In their session at the 11th International Cloud Expo, Tom Katayama, Sr. Solutions Architect at Cordys, and Glenn Donovan, Regional Director at Cordys, will describe via case study examples, how various Cloud Service Provider business models can be facilitated via this approach. Topics to be discussed include: Vendor agnostic and unified XaaS provisioning ...
As organizations are investing hundreds of millions of dollars on cloud-related technologies and services, many within the organization have no clue what cloud is and the derived value and benefit of embracing a sound cloud strategy. With new cloud solutions that impact non-IT folks like iCloud, Facebook, etc., more and more people are wondering and asking the questions: What is cloud computing and what is my organization doing to leverage the latest cloud technologies? When the organization i...
Two types of cloud infrastructure have emerged to clearly differentiate the choices enterprises face in their cloud migration strategies: enterprise virtualization clouds versus elastic cloud infrastructure. The former defines infrastructure built to support legacy enterprise applications like those built on SAP and Oracle. The canonical example is a vSphere stack. The second cloud category – elastic cloud – defines infrastructure built to support new, dynamic applications like mobile, gaming, B...
In his session at the 11th International Cloud Expo, Thomas Anderson, VP of Field Operations at ManageIQ, will review dynamic, policy-based private and hybrid cloud management and optimization strategies. He will discuss how ManageIQ’s EVM solution enforces policies on IT services, workload placements and resource optimization across Red Hat, VMware, Microsoft and Amazon cloud infrastructures through a 'single pane of glass,' to meet service levels and business priorities. Thomas Anderson is VP...
PayPal is the faster, safer way to pay and get paid online, via mobile devices and in store. The service allows people to send money without sharing financial information, with the flexibility to pay using their account balances, bank accounts, credit cards or promotional financing. With more than 113 million active accounts in 190 markets and 25 currencies around the world, PayPal enables global commerce. PayPal is headquartered in San Jose, Calif. and its international headquarters is located ...