
|
![]() |
From the Blogosphere Lessons Learned from the Amazon Web Services Outage
The only surprising thing about this AWS outage was that anyone was surprised by it
By: Sven Hammar
Oct. 26, 2012 11:00 AM
On Monday, Amazon Web Services — the leading provider of cloud services — suffered an outage, and as a result, a long list of well-known and popular websites went dark. According to Amazon’s Service Health Dashboard, the outage started out as degraded performance of a small number of Elastic Bloc Store (EBS) storage units in the US-EAST-1 Region, then evolved to include problems with the Relational Database Service and Elastic Beanstalk as well. The only surprising thing about this AWS outage was that anyone was surprised by it. It wasn’t the first time AWS had a major outage or problems with this data center. If you remember, back in June a line of powerful thunderstorms knocked the power out at a major Amazon hosting center. The backup generator failed, then the software failed, and, well, you know the drill. A corollary of Murphy’s Law is that if multiple things can go wrong, they will all go wrong at once. In both of these instances (and in all Amazon Web Services outages, in fact) some customers were knocked “off the air” while others continued running without a hiccup. You would think that eventually companies will learn to anticipate the inevitable AWS outages and take active steps to prepare for them. There are best practices and solutions on how to reduce vulnerability to an outage, but they’re rarely implemented. That’s because people don’t think that anything could happen to Amazon — obviously, things happen. Instances like this are a learning opportunity if we take the time to think about why they happened and what could have been done to prevent them. Here are six lessons that I think we can learn from the Amazon Web Services outages. Lesson 1 — Clouds are made of components that can fail. When people think of the cloud, they think that there is some amorphous and untouchable blog up in the sky. And while that’s a nice bit of marketing, it is not a useful model for operational planning. Be mindful of your cloud provider’s architecture and how it is built to manage failure of a component or a zone blackout. Then anticipate that failures can happen at any point in the cloud infrastructure. Lesson 2 — The stress of failure will trigger a cascade of other failures. After reading a description of the outage, you get the sense that it was just one thing after another. What started as a small issue affecting one Northern Virginia data center quickly spread, causing a chain reaction and outage that disrupted much of the Internet for several hours. Remember Murphy and his law? Lesson 3 – -Spikes matter. When a cloud fails, hundreds of customers are impacted. As they try to recover, they will be stressing the cloud provider’s infrastructure with a peak load that is guaranteed to cause even more problems. If you get these transition spikes, they get worse and worse. Every time you reboot, it takes longer and longer. If you have ten servers doing that, that’s bad. If you spike a thousand servers, that’s really bad. Something that would have taken five minutes to fix will now take five hours when you get into that transition type of syndrome. Lesson 4 — Cloud providers provide the tools to manage failure, but it is up to you to put your own failover plans in place. AWS, for example, is broken into zones. If a component in the Virginia zone goes down and the whole matrix is dead, then (in theory) you should be able to move all your data to another zone. That other zone might be hosted, unaffected, in Ireland and then you are up and running again. This is one of the big differences between the cloud and more traditional approaches to IT. It is up to the application (and by extension, the application’s designer) to manage its interaction with the cloud environment, up to and including failover. Most cloud providers offer tools and frameworks to support failover, but you are responsible for implementing that best practice into your system operation and into the applications. Lesson 5 — You need to put your failover plans through a full-blown load test. It’s not enough to have a strategy in place for failover. You have to test it under real-world conditions. Even the best laid failover plans, once implemented and designed, might have hiccups when a real outage occurs. A full-blown cloud load test can help you see how long the failover process will take to kick in and what other dependencies might need to be sorted out. Obviously this isn’t easy. If it was, Reddit, Foursquare, Airbnb and others wouldn’t have been impacted by the AWS outage. Lesson 6 — Conduct fire drills. While a load test will confirm that your failover plan works as you expect, it will also give your team some real experience in executing the plan. Remember the fire drills you used to do in school? Fire drills help train students, teachers, and others to know exactly what they’re supposed to do and where they’re supposed to go in the event of an emergency. All the bugs in the process are worked out during the fire drill, and the more everybody does the drills, the more comfortable there are with what they need to do. And if a real emergency happens, everybody knows how to leave the building calmly. You want to do the same thing with your failover plan, and load testing can help you get there. Fire drills save lives and load tests save cloud apps. Is your failure worth more than $28? Amazon offers reimbursement to its customers based on the amount of downtime the customer experiences. The last time our Amazon Web Services went down, we got a $28 reimbursement. So my final lesson learned (I guess this makes for seven lessons) is this: The cost of downtime for your organization — in lost revenue, poor customer experience, etc. — is far, far greater than just what you are paying your cloud provider. $28 is not going to save your day. You have to make sure that you have a failover solution that’s ready and working. Don’t wait for Amazon to solve this problem for you, because it’s only a $28 problem for it. The biggest lesson learned from these AWS outages is that you need to configure properly and you need to train your people. These types of events will always happen, and when they do, you need to be trained ahead of time. Load testing itself is a good way to validate and train. That way when a real emergency occurs, your team can react in a calm, collected manner to a situation they’ve experienced dozens of times before. ![]()
Cloud Expo Breaking News
Best Recent Articles on Cloud Computing & Big Data Topics ![]() The Arlington, Virginia-based National Science Foundation has just released its "Report on Support for Cloud Computing" - in response to the America Competes Reauthorization Act of 2010, Section 524.
It is an absolute must-read for all concerned with current and future research projects in Cloud Computing. Reads: 5,525 ![]() "The volume of data we're generating now from machines pales in comparison to the volume of data we'll soon generate from our own bodies," says data security expert Dave Asprey. Writing in a Trend Micro blog, Asprey - who is one of the leaders in the emerging Quantified Self movement - explains his vision of a world in which personal biometrical data is shared via the cloud. Reads: 8,645 ![]() Cloud computing has caught the attention of business leaders around the world in every
industry because of its enormous transformative potential. Visionary companies know that
the value of the cloud is far greater than the current focus solely on technology and operating
costs: when combined with a collaborative approach to designing processes, cloud computing
will change how we do business.
Reads: 12,644 ![]() Want to make sense of the hottest new concept in Enterprise IT?
Want to understand in just hours what experts have spent many hundreds of days deciphering?
Cloud computing is a technology that has rapidly evolving peppered with a lot of hype along the way. Customers find it hard to navigate through this and make sense of what aspects of this technology will give them real business benefit.
Cloud Computing Bootcamp, led by our 2012 Bootcamp Instructor Larry Carvalho, is a great way to get a practical understanding of this technology. We offer multiple days of actionable insight into what vendor offerings are currently available and help you comprehend their strategy.
The ever-popular Bootcamp, which is now held regularly around the world, is being held in conjunction with the 10th Cloud Expo, June 11-14, 2012, at the Javits Center, New York, NY. Reads: 7,416 ![]() Did you know that ninety percent of the data in the world has been created in the last two years? Every day, we create 2.5 quintillion (or 2.518) bytes of data, according to IBM.
As corporations across all industries globally are struggling with how to retain, aggregate and analyze this mounting volume of what the industry refers to as Big Data, it also provides a unique opportunity for innovative startups that recognize the business prospects Big Data presents. Big Data is not just unlocking new information but new sources of economic and business value.
Interactivity is driving Big Data, with people and machines both consuming and creating it. Digital companies focused on becoming good at aggregating and analyzing the data created by the end users of their product, who then provide their customers with solid insights taken from that data are at a distinct competitive advantage over others in the marketplace. Reads: 5,481 ![]() SYS-CON Events announced today that SHI, a $4 billion+ global provider of information technology products and services, has been named Platinum Plus Sponsor of SYS-CON's 10th International Cloud Expo, which will take place on June 11–14, 2012, at the Javits Center in New York City, New York.
Founded in 1989, SHI International Corp. is a global provider of technology products and services. Driven by the industry's most experienced and stable sales force and backed by software volume licensing experts, hardware procurement specialists, and certified IT service professionals, SHI delivers custom IT solutions to Corporate, Enterprise, Public Sector, and Academic customers. With over 1,800 + employees worldwide, SHI is the largest Minority/Woman Owned Business Enterprise (MWBE) in the U.S. and is ranked 19th among Everything Channel's VAR 500 list of North American IT solution providers. Reads: 4,037 ![]() SYS-CON Events announced today that Rackspace Hosting, the service leader in cloud computing, has been named "Platinum Plus Sponsor" of SYS-CON's 10th International Cloud Expo, which will take place on June 11-14, 2012, at the Javits Center in New York City, New York, and the 11th International Cloud Expo, which will take place on November 5-8, 2012, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Rackspace Hosting, the service leader in cloud computing, has been named "Platinum Plus Sponsor" of SYS-CON's 10th International Cloud Expo, which will take place on June 11-14, 2012, at the Javits Center in New York City, New York, and the 11th International Cloud Expo, which will take place on November 5-8, 2012, at the Santa Clara Convention Center in Santa Clara, CA ![]() 9th International Cloud Expo, held on November 7 - 10, 2011, in Santa Clara, CA, attracted more than 120 sponsors and exhibitors with over 7,500 registered delegates and four content-packed days with a rich array of sessions about the business and technical value of cloud computing led by exceptional speakers from every sector of the cloud computing ecosystem.
The Cloud Expo series is the fastest-growing Enterprise IT event in the past 10 years, devoted to every aspect of delivering massively scalable enterprise IT as a service.
We invite you to enjoy here our photo album of the show. Reads: 9,233 ![]() Ulitzer.com announced "the World's 30 most influential Cloud bloggers," who collectively generated more than 24 million Ulitzer page views. Ulitzer's annual "most influential Cloud bloggers" list was announced at Cloud Expo, which drew more delegates than all other Cloud-related events put together worldwide. "The world's 50 most influential Cloud bloggers 2010" list will be announced at the Cloud Expo 2010 East, which will take place April 19-21, 2010, at the Jacob Javitz Convention Center, in New York City, with more than 5,000 expected to attend. Reads: 37,651 ![]() Cloud computing is becoming one of the next industry buzz words. It joins the ranks of terms including: grid computing, utility computing, virtualization, clustering, etc.
Cloud computing overlaps some of the concepts of distributed, grid and utility computing, however it does have its own meaning if contextually used correctly. The conceptual overlap is partly due to technology changes, usages and implementations over the years.
Trends in usage of the terms from Google searches shows Cloud Computing is a relatively new term introduced in the past year. There has also been a decline in general interest of Grid, Utility and Distributed computing.
Likely they will be around in usage for quit a while to come. But Cloud computing has become the new buzz word driven largely by marketing and service offerings from big corporate players like Google, IBM and Amazon. Reads: 191,744 ![]() SYS-CON Events announced today that Cloud Expo 2012 New York, the 10th International Cloud Computing Conference & Expo, will take place June 11-14, 2012, at the Javits Center in New York City.
The International Cloud Computing Conference & Expo series is the world's leading Cloud-focused event and is held three times a year, in New York, Silicon Valley and in Europe. Over 400 corporate sponsors and 20,000 industry professionals have participated in Cloud Expo since its inception, more than all other Cloud-related events put together.
"10th Cloud Expo is trending to be both the biggest ever and the best-attended event in the international Cloud Expo series to date, so it is only natural that we should be holding it in the biggest and best conference venue anywhere on the East Coast, the Jacob K. Javits Convention Center," stated Carmen Gonzalez, CEO of Cloud Expo. "If you are not at Cloud Expo New York, June 11-14, at the Javits Center, then you risk not getting the relevant parts of your IT infrastructure into the Cloud in time." Reads: 13,538 ![]() Hadoop, MapReduce, Hive, Hbase, Lucene, Solr? The only thing growing faster than enterprise data these days is the landscape of big data tools. These tools, which are designed to help organizations turn big data into opportunities, are gaining deeper insight into massive volumes of information. A recent Gartner report predicts that enterprise data will increase by 650% over the next five years, which means that the time is now for IT decision makers to determine which big data tools are the best - and most cost-effective - for their organization. Reads: 8,901 |
![]()
![]()
![]() The World's Most Influential Blogs
![]()
![]() |