The Wayback Machine - https://web.archive.org/web/20160417143744/http://cloudcomputing.sys-con.com/node/3758657

Welcome!

@CloudExpo Authors: Elizabeth White, Craig Lowell, Liz McMillan, Pat Romanski, Sanjay Zalavadia

Related Topics: Wearables, Microservices Expo, Linux Containers, Containers Expo Blog, @CloudExpo, @DevOpsSummit

Wearables: Article

State of 5th DevOps Report By @RealGeneKim | @DevOpsSummit #DevOps

As I have mentioned many times, I've learned more doing this project than any project in my professional career

Behind the Scenes of the 5th State of DevOps Report

As I have mentioned many times, I've learned more doing this project than any project in my professional career. This has been a four-year collaboration with Jez Humble and Dr. Nicole Forsgren, as well as Nigel Kersten and Alanna Brown from Puppet Labs.

"I only got four hours of sleep last night. I woke up after an anxiety dream about deadlocks in the database."

Anyone who has run an online service probably knows this feeling. And this is what Jez Humble wrote on our Slack channel, 24 hours before the launch of the 5th annual State of DevOps Survey. As I have mentioned many times, I've learned more doing this project than any project in my professional career. This has been a four-year collaboration with Jez Humble and Dr. Nicole Forsgren, as well as Nigel Kersten and Alanna Brown from Puppet Labs.

This year Jez, Nicole, and I created DevOps Research and Assessment (DORA) with the goal of taking what we've learned from analyzing over 20,000 respondents that we've collected over the last four years and using it to help organizations assess and improve how their teams are performing, both in terms of practices and performance.

If you're interested in survey design or survey analysis, you can read more in the 2014 State of DevOps Report: Statistics Class Edition.

In this post, I want to give you a behind-the-scenes look at the 24 hours leading up to the survey launch of a survey tool we built, as well as some of our lessons learned. In previous years, we used a fantastic tool called SurveyGizmo to execute the survey. Puppet Labs had uses SurveyGizmo for all their internal and external surveys, so it made sense to use something that was familiar.

However, this year, we decided to use a survey engine that Jez Humble wrote. The DORA team has been using this tool for our customers, and Jez persuaded the rest of us that we should use it for the 2016 State of DevOps research. Nigel Kersten, a former Google SRE and now CIO at Puppet Labs, and I both thought this was a preposterous idea because we had never tested the survey engine at scale, but Jez can be very convincing.

I wanted to share with you what that 24 hours before our launch was like, focusing on how we shored up our production telemetry and did some last-minute contingency plans, as well as some of our top lessons learned. I realize that compared to the stories one hears about at Velocity or DevOps Enterprise Summit, our launch is small potatoes, but for us, the stakes were high, with over six months of preparation depending on how the next few days went.

(And to be perfectly honest, one of my biggest fears was the reputational fallout of screwing this up. I could just imagine the headlines: "co-authors ofThe Phoenix Project and Continuous Delivery totally screw up, doing a Phoenix Project to themselves.")

The TL;DR version:

  • Even last-minute preparations can pay off
  • Production monitoring can compensate for many shortcuts necessitated by real-world constraints
  • Hosted Graphite and New Relic APM are amazing

Take the Survey! >>>

The 24 Hours Before Launch
The launch of the State of DevOps survey was Tuesday, March 22, 2016. This is when tens of thousands of emails would go out, announcing that the survey was live, from IT Revolution, Puppet Labs, and fellow sponsors, Atlassian,AutomicCA TechnologiesHP EnterpriseSplunk, and ThoughtWorks.

One minor complication: that week, everyone was scattered around the globe. Jez was working full-time at the amazing 18F organization inside the US Federal Government, but was in London doing a speaking engagement, having just been in India. As the sole engineer, he was doing all the last-minute work during his evenings.

Nicole, our resident researcher and protector of the sanctity of the survey instrument, was in the middle of a weeklong trip to India for her work at Chef, with unreliable hotel internet access.

And me? I was under the gun finishing up the developmental editing work for the upcoming DevOps Handbook (Yes, it's coming!), to stay ahead of the copy editors. I was also down with pneumonia.

So, 24 hours before launch, the entire team was scattered around the globe, each with lots of daytime obligations. Alanna and I had just finished reviewing the survey instrument, and we had a list of issues that we needed to fix.

It was at this moment, pondering the implications of potential code changes to the survey engine, when I realized all of the incomplete things that I personally wanted to do for this project, such as writing a testing harness in Ruby and Nokogiri so that we could do load testing, etc. But with only 24 hours to launch, there are only so many corrective actions you can take.

The mental checklist that I felt like we had to get through was along the following lines, all based on what could go wrong and how we could best mitigate those risks:

  • What production metrics were we tracking?
  • Approximately how many concurrent sessions should we be able to handle?
  • How exactly will we know if the survey engine falls over? (Besides Twitter, that is...)
  • What are the exact steps we need to take to failover onto a backup service?

We had the entire project team assembled in Slack, and we started going through these questions.

Last-Minute Production Metrics
On the subject of performance, Jez walked us through his rationale for confidence in his app's ability to handle the expected load. "Even if we had 50K respondents go through in the next month, that's still only 74 respondents per hour, or 1 completed survey per hour, or 8 survey pages per hour. That is nobody's idea of a high load."

Jez further opined that the app runs on the ever-sturdy and well-understood LAMP stack (Linux/Apache/MySQL/PHP), and "there's only one database connection per page load, there are no sessions, and each database call is fast, so we should be clear of any of the classic performance risks."

Good enough for me.

To track how the application was performing, we decided to post the following metrics to a Slack channel every 30 minutes:

  • Number of surveys started and completed
  • Average survey completion time

On top of this, we needed some earlier indicators of problems so that we could take any corrective actions before something catastrophic occurred, and we discussed how to shore up our production metrics. Earlier in the year, I had worked on a fun book tracking project with Tom Limoncelli, and we used a great service called Hosted Graphite.

Hosted Graphite is incredibly simple to use - it's basically Graphite and Grafana as a service. They have many libraries available, such as for Ruby, PHP, Python, etc. Literally, within a couple of minutes, you can see metrics displayed in a new Hosted Graphite account. (Bonus: there's a free 14-day trial, and it only takes a minute to sign up. It's a fantastic service.)

Once you have an account set up, you write one line of code to send a metric, like this:

HostedGraphite.send_metric("surveys.completed", 1)

Within an hour, we had the following metrics displayed in a nice dashboard.

  • The number of milliseconds to render each web page (which includes any database queries)
  • Number of people completing each survey: registration page, page 1, page 2, ... page 8

(Quite frankly, I was blown away that any of the metric events got displayed at all. Hosted Graphite lets you send via TCP or UDP. Jez was so paranoid about blocking network calls that he insisted on using UDP. To my amazement, the metric events were actually received. I always assumed UDP packets were doomed to get dropped somewhere, especially since I think the Hosted Graphite servers are in Germany. Although looking at the graphs now, I'm pretty sure many of our UDP events aren't actually making it to Hosted Graphite servers - but some data is better than no data, right?)

Hosted Graphite also has an Alerting module that is in Beta, so we set an alert to post to our Slack channel if any web page took more than 10 ms to render.

Jez also installed New Relic APM, which has saved the butts of an entire generation of programmers who need to figure out "Holy crap, why are all our database calls taking so long?" New Relic would allow us to have lots of telemetry if something started going wrong with our database calls.

(Incidentally, the fact that we were working in Slack not only made it possible for a team scattered around the world to work effectively, it also makes it easy to recover accurate timelines, after memories fade.  I referred to the channel history constantly while writing this article, and all the screenshots came from our Slack channel, as well.  Seriously, if you're not using something like Slack,FlowdockHipChat, you need to try it! Trust me, once you do, you'll never go back!)

Planning for Failure
As a fallback, Alanna also replicated the entire survey in SurveyGizmo. If something went wrong, Jez would redirect all traffic to our SurveyGizmo site. Although as I write this, I now realize that the only person who knows how to do this and has the relevant login credentials is Jez. Given that he was going to be on a transatlantic flight from London to San Francisco three days after launch, this would have left us unable to failover. Oops! We should have walked through exactly what that procedure was and documented it.

(Huh! Apparently, Jez had walked Nigel Kersten through this procedure, and provided all the necessary login credentials - I missed that Slack message from 3 a.m.  Having all these procedures in one Google Doc would have been great, so that you don't need to have read all the Slack messages.)

We also discovered that Jez was being extremely parsimonious (that's a fancy word for "cheap"), having put us on a relatively small VM (256MB RAM + 1 CPU) on Gandi, his favorite hosting platform. After some persuasion, he finally bumped us up to 2 CPUs and 2GB of RAM (which is still pretty puny - did I tell you that Jez was cheap?)

Seriously, not putting on a bigger VM is just silly. On that project that Tom Limoncelli and I were working on, Tom had initially created an "f1-micro" VM in Google Compute Engine which has 512MB of RAM. It actually worked great for months, until I needed to install some Python libraries that needed to compile OpenSSL. For hours, I pounded my head against the wall trying to figure out why the "pip install" command was failing, when I finally realized the load average was pegged at 9+. Out of memory. I probably spent four hours on that and similar problems, which could have been fixed by spending $7 more per month to run on a 1.5 GB memory "g1-small" instance. In my opinion, us letting Jez run on a measly 2GB of RAM was actually an unnecessary risk.

We also started hourly database backups to S3 from a cronjob to make sure we didn't lose data, and verified that we could actually restore the database.

Launch Day
It's 7 a.m. on Tuesday - one hour before the email campaigns launch. Jez is working on closing a potential problem involving links that could accidentally disclose user data, but he is leery of making significant changes that could introduce a fatal error. (More on this later.)

Alanna has been making changes to the survey throughout the night, and Nicole is almost finished making her changes, too. We just discovered that in some cases, only Jez can make code changes, and only Nicole and Alanna could make some survey changes - this meant that correcting certain errors actually had to involve two people. A very amusing "non-devopsy" situation!

By the way, the fact that Alanna is making changes to the survey, only hours after Jez showed her how, is so cool. I'm always blown away by how fearless people like Alanna are - unlike scaredy-cats like me. People like that are such a tremendous asset. And because Nicole was having intermittent network issues, thank goodness Alanna was able to make these changes on her behalf!

(And in contrast to Alanna's fearlessness, in a very ironic moment, Jez was afraid to make any significant code changes on the day of launch. If you've heard enough of Jez's talks, you'll know that he espouses fearlessly making production changes, so this was quite amusing to us all. But his fear was for a good reason - he had only gotten 4 hours of sleep because of dreams of database locking problems, and had other obligations throughout the day of launch - frankly, I was petrified of us introducing a conditional error that would send everyone to a blank page. )

Looking back at the Slack logs, one of my favorite moments can be see in this screenshot - Nicole is making changes from her hotel in India, and her network connection keeps cutting out, which is resulting in very strange session behavior that is freaking Jez out. And Nicole also needs help from Jez and Alanna to confirm that the changes she's pushed actually were actually pushed.

Slack messages

But by 7:30 a.m., all our changes are done and we're finally ready - we all have the Grafana dashboard up from Hosted Graphite, and amazingly, the screenshot below is what we see: the bottom graph shows the people marching through the survey pages, and the top graph shows the time required to render the survey pages (which require a database call), which doubled to 6 ms, but then stabilized at 4 ms, with the load average hovering at 0.5.

Hosted Graphite

After all the stress of the previous day, we're getting real survey traffic from the email campaigns, and the site is running and staying up!

At 8:24 a.m., Jez exclaims, "Someone got to page 6 of the survey!" We all cheer.

A minute later, Jez observes, "This is like watching a tortoise race," as we're all watching and waiting for people to finish taking the survey (which we eventually learn is taking people 22 minutes, on average).

And I suppose that is one of the lessons of this whole exercise. As Jez said, "We instrumented the crap out of the code so I could sleep at night."

Hosted Graphite

The Need for Situational Awareness
Yeah, John Allspaw hates the word "situational awareness," so I try to avoid using that term. But, one of the assets that we had was having Nigel constantly scanning social media and all the Slack channels he's on, telling us about things going wrong.

We learned about SSL certificate problems and even a perception that we were leaking user data. And here's a reality - no amount of production telemetry takes the place of real people who care, observing how people in real communities are perceiving and reacting to your work. Plus, Nigel is such a well-respected person that he was the ideal spokesperson to help correct some misperceptions that surfaced.

Also in hindsight, designating someone who can keep a clear head and isn't buried in the weeds to do external communications is vital - when we were talking about how to communicate that we weren't actually leaking survey data, Nigel shortcutted the entire discussion with the following phrase: "I think it's a pretty simple message."  And we all trusted that he could do what needed to be done.

Conclusion
Of course, there was a lot more work leading up to launch than is depicted here. The process of creating a survey instrument with this level of rigour and depth takes a lot of passion and commitment from everyone involved.

We sincerely thank all of you for supporting our efforts by taking the survey and sharing it with your peers and friends!

Take the 2016 State of DevOps Survey now! >>>

More Stories By Gene Kim

Gene is a multiple-award-winning CTO, researcher and author. He was founder and CTO of Tripwire for 13 years. He is a huge fan of IT operations and how it can enable developers to maximize throughput of features from “code complete” to “in production,” without causing chaos and disruption to the IT environment. He is passionate about IT operations, security and compliance, and how IT organizations successfully transform from “good to great.”

IT Revolution assembles technology leaders and practitioners through publishing, events, and research. Our goal is to elevate the state of technology work, quantify the economic and human costs associated with suboptimal IT performance, and to improve the lives of one million IT professionals by 2017.

@CloudExpo Stories
SYS-CON Events announced today that BMC Software has been named "Siver Sponsor" of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2015 at the Javits Center in New York, New York. BMC is a global leader in innovative software solutions that help businesses transform into digital enterprises for the ultimate competitive advantage. BMC Digital Enterprise Management is a set of innovative IT solutions designed to make digital business fast, seamless, and optimized from mainframe to mo...
You are moving to the Cloud. The question is not if, it’s when. Now that your competitors are in the cloud and lapping you, your “when” better hurry up and get here. But saying and doing are two different things. In his session at @DevOpsSummit at 18th Cloud Expo, Robert Reeves, CTO of Datical, will explain how DevOps can be your onramp to the cloud. By adopting simple, platform independent DevOps strategies, you can accelerate your move to the cloud. Spoiler Alert: He will also make sure yo...
SYS-CON Events announced today that Avere Systems, a leading provider of enterprise storage for the hybrid cloud, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Avere delivers a more modern architectural approach to storage that doesn’t require the overprovisioning of storage capacity to achieve performance, overspending on expensive storage media for inactive data or the overbuilding of data centers t...
The initial debate is over: Any enterprise with a serious commitment to IT is migrating to the cloud. But things are not so simple. There is a complex mix of on-premises, colocated, and public-cloud deployments. In this power panel at 18th Cloud Expo, moderated by Conference Chair Roger Strukhoff, panelists will look at the present state of cloud from the C-level view, and how great companies and rock star executives can use cloud computing to meet their most ambitious and disruptive business ...
SYS-CON Events announced today that FalconStor Software® Inc., a 15-year innovator of software-defined storage solutions, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. FalconStor Software®, Inc. (NASDAQ: FALC) is a leading software-defined storage company offering a converged, hardware-agnostic, software-defined storage and data services platform. Its flagship solution FreeStor®, utilizes a horizonta...
SYS-CON Events announced today that Pythian, a global IT services company specializing in helping companies adopt disruptive technologies to optimize revenue-generating systems, has been named “Bronze Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2015 at the Javits Center in New York, New York. Founded in 1997, Pythian is a global IT services company that helps companies compete by adopting disruptive technologies such as cloud, Big Data, advanced analytics, and DevO...
Programmable network connectivity and network overlay technologies like Docker libnetwork, Weave Net, and Calico are essential tools for DevOps engineers using orchestration tools to manage and deploy Docker containers in production. Because network troubleshooting and optimization falls within the jurisdiction of DevOps, it’s vital that DevOps engineers understand exactly how network overlays work. In his session at @DevOpsSummit, 18th Cloud Expo, Dirk Wallerstorfer, Technology Lead for netwo...
Riverbed Technology has announced Australia-based packaging company, Visy, is the first organization to deploy a Riverbed and Microsoft joint solution designed to eliminate the headaches of its branch office IT infrastructure, improve business continuity and minimize business disruption in the event of an incident. Visy selected Riverbed’s® hyper-converged edge solution SteelFusion™ to virtualize and consolidate its islands of remote and branch office infrastructure (including server, storage, a...
SYS-CON Events announced today that Men & Mice, the leading global provider of DNS, DHCP and IP address management overlay solutions, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. The Men & Mice Suite overlay solution is already known for its powerful application in heterogeneous operating environments, enabling enterprises to scale without fuss. Building on a solid range of diverse platform support,...
SYS-CON Events announced today that VAI, a leading ERP software provider, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. VAI (Vormittag Associates, Inc.) is a leading independent mid-market ERP software developer renowned for its flexible solutions and ability to automate critical business functions for the distribution, manufacturing, specialty retail and service sectors. An IBM Premier Business Partn...
There are several IoTs: the Industrial Internet, Consumer Wearables, Wearables and Healthcare, Supply Chains, and the movement toward Smart Grids, Cities, Regions, and Nations. There are competing communications standards every step of the way, a bewildering array of sensors and devices, and an entire world of competing data analytics platforms. To some this appears to be chaos. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists will discuss the vast to...
SYS-CON Events announced today that Anexia will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Anexia offers high-quality customized managed hosting solutions for SaaS and IaaS companies. The company was founded in 2006 in Klagenfurt, Austria. Today, it has additional offices in Vienna, Graz, Munich, Cologne and New York City to serve numerous international customers.
Across nearly every industry, innovative entrants are disrupting traditional markets and displacing long-established players. This IBM Point of View white paper describes the IBM Bluemix Garage Method that was created to support the October 19, 2015 launch.
If there is anything we have learned by now, is that every business paves their own unique path for releasing software- every pipeline, implementation and practices are a bit different, and DevOps comes in all shapes and sizes. Software delivery practices are often comprised of set of several complementing (or even competing) methodologies – such as leveraging Agile, DevOps and even a mix of ITIL, to create the combination that’s most suitable for your organization and that maximize your busines...
Managed IT services wasn't even a phrase until the early 2000s and, today, there are over 75,000 IT service providers in North America alone. By 2020, there are going to be 50 billion connected devices, and managed IT services might cease to be a phrase again. Unless the MPSs adapt to the new, connected world, we are going to see diminishing returns in that space. In his session at @ThingsExpo, Kirill Bensonoff, CEO of Unigma, will discuss the different opportunities IoT will create for MSPs a...
Following the notion of "The cloud" as a model and not a place, learn how to extend your SoftLayer infrastructure to utilize the PaaS offerings of Bluemix. In his session at 18th Cloud Expo, Ryan Tiffany, a Sales Engineer at SoftLayer, an IBM Company, will utilize both the command line and GUI portals and show you how to order a SoftLayer server and configure a front end application to use the Database as a Service offering from Bluemix.
SYS-CON Events announced today that DatacenterDynamics has been named “Media Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY. DatacenterDynamics is a brand of DCD Group, a global B2B media and publishing company that develops products to help senior professionals in the world's most ICT dependent organizations make risk-based infrastructure and capacity decisions.
The IoT is changing the way enterprises conduct business. In his session at @ThingsExpo, Eric Hoffman, Vice President at EastBanc Technologies, will discuss the significant expected growth of connected devices, what that means for businesses, and how to optimize IoT-enabled commerce that will soon disrupt industries. He will provide several examples; one including a Washington, D.C.-based sports club, and how it leveraged IoT and the cloud to develop a comprehensive booking and automated sys...
SmartBear Software has announced a new developer focused test automation tool, TestLeft. The tool enables developers working in an Agile and continuous delivery environment to create robust tests within IDEs, which helps reduce test creation and maintenance time. “We are very excited about opportunities SmartBear’s TestLeft will bring to our testing organization,” said Brian Schaffer, Director Automated Testing at Interactive Intelligence. “Combining the best UI object recognition in the indus...
Panzura has announced the results of its “State of Manufacturing Cross Site-CAD Collaboration” survey conducted at SOLIDWORKS World 2016. The survey details the top challenges that manufacturing professionals face when working within Computer Aided Design (CAD)/Computer Aided Manufacturing (CAM) and design applications across sites. More than 5,000 manufacturing professionals attended SOLIDWORKS World 2016, with a significant portion participating in the “State of Manufacturing Cross-Site CAD C...