Nora Jones has had experience with chaos engineering ever since her stint at Jet.com in 2015 and is founder of Jeli.io. Through blogging about her experiences, she met Casey Rosenthal, who has led a chaos engineering team for Netflix and has just co-founded a company called Verica. They both have extensive experience with chaos engineering and a keen understanding that it is more about people than tools.
What Is Chaos Engineering?
The first question that comes up is an obvious one: What is chaos engineering? Casey explains that it is the continuous experimentation on a distributed system to find systemic weaknesses. It’s more than the idea of “breaking stuff in production.” Breaking stuff on its own provides no business value.
What is more valuable is a discipline that proactively improves distributed system availability, rather than one that is reactive to breakages. Almost all of our great availability practices—observability, alerting and the like—are reactive. Netflix enhanced that by looking at how to prevent systemic failures in the future. This became absorbed as part of its chaos engineering practice. The team gives engineers procedural ways of creating systemic output and business value and lets developers put context around known unknowns.
How Can Chaos Engineering Be Used?
To Nora, chaos engineering can be used in a lot of ways. At Jet.com, she applied it using Chaos Monkey to find points of single-node failures. At Netflix, they used a tool called Chap to run experiments on the user experience for certain disruptions.
For example, what happens if something goes wrong with Netflix’s bookmark service, which tracks where people are in a show or movie? These experiments help figure out which systems are critical to the viewer’s experience and ensure those systems are resilient to failures. In one experiment, the Netflix team thought the bookmark service would not be critical to the customer; however, it turns out failing it caused customers to become very confused as to where they were in the stream, and users hammered the system trying to rewind and fast-forward to the right place.
Ultimately, chaos engineering is about understanding the boundaries of your safety margin in a way no other practice can show you. Not even observability can provide those insights. Chaos engineering lets you know whether you’re standing on solid ground or on the edge of a cliff. It changes the way engineers talk about their service. And it helps developers avoid certain tasks, claiming “Oh, we aren’t critical, so it can’t be us.”
Documenting the Journey of Chaos
Even preparing for chaos engineering provides value to an organization. Nora talks about how she was exploring whether a system was critical or not, and what that means. She realized that the information she was gathering would be valuable to many other teams. So, she built a platform where other developers could see it and learn how to apply it to their own situations.
Casey added that it’s great when a team can do enough research that they don’t even need to run a chaos experiment—they already know what will fail and what its impact will be.
In other words, the process of designing a chaos experiment is often just as valuable as running the experiment itself. It’s also valuable to first analyze your incidents over time before just jumping into chaos engineering. Your incidents can act as a map of the biggest potential critical parts of your system. Focus on learning, not just experimentation.
Chaos and Threat Modeling
Some of these aspects of chaos engineering are similar to threat modeling. Is there a relationship between these two practices? Casey believes there is some similarity. They both study complex systems that no single person can understand all aspects of.
Threat modeling requires engineers to write tests for the properties of the system they know about. They can’t write tests for the parts of the system they inevitably don’t know. With chaos engineering, you can explore those unknown properties and flesh them out.
And the two practices can go hand in hand, Nora added. When you model threats, you write down how you assume your system works. Running chaos experiments can allow you to validate that model. In this sense, it’s a lot like applying the scientific method to the availability of a software system.
Stories of Chaos
Both Casey and Nora had some interesting stories about these ideas in practice.
Casey brought up an example that his co-founder had to deal with. They were doing security chaos engineering and formed a hypothesis about how secure their SIM cards were. They found that they could prevent a specific attack only half the time. A security component was logging this exposure, but it did so in a way that was very hard to find. This is a great example of an experiment that can open your organization’s eyes to security flaws you had no idea existed.
Nora shared a story from Jet.com where they were very open to running these experiments. Nora kicked off an initial set of services to run Chaos Monkey on. Well, the experiments took longer than expected to run. It held up deployments and delayed the releases on a lot of services. IT was frustrated by the experience, but it was a learning experience. The teams did not give up on chaos engineering, but they learned how to improve the process to automate away bottlenecks. It taught the Jet.com teams that chaos engineering is more about the culture and process than the tooling.
Metrics That Can Help Organize the Chaos
When applying chaos engineering, what metrics can we use to measure its success? Specifically, what business-level metrics can we apply?
Well, one simple way is to have people self-report whether an experiment was effective. Casey noted that you can take these types of qualitative responses and turn them into quantitative metrics. You can take that a step further and ask people around the team if they learned something. Even further, you can ask people using the system if they noticed changed behavior. In each of these steps, it’s progressively more difficult to identify the value that chaos engineering provides.
The trickiness of chaos engineering’s visibility can make it hard to figure out when to prioritize it. Usually, some negative service impact, like an outage, or big event is necessary to kick off a set of experiments. One of Nora’s favorite formulas to give chaos engineering visibility is this: performance improvement = insight generation + error reduction.
Often, organizations only focus on error reduction. But making insights visible and writing them down is key to ensuring high performance.
This session was summarized by Mark Henke. Mark has spent over 10 years architecting systems that talk to other systems, doing DevOps before it was cool, and matching software to its business function. Every developer is a leader of something on their team, and he wants to help them see that.