All Day DevOps - Blog

Swim Don’t Sink: Why Training Matters to an SRE Practice

Written by Sylvia Fronczak | Jul 1, 2020 8:06:00 PM

Today we’re going to talk about why training matters and why the learning method of “sink or swim” doesn’t cut it when you are ramping engineers up on your team.

Our speaker on the subject is Jennifer Petoff, Head of SRE Education at Google and co-editor of the book Site Reliability Engineering: How Google Runs Production Systems.

Why Is Training Important?

With SRE, there’s so much to learn. Many people believe that unleashing a firehose of information is an effective approach. However, we retain only a small percentage of the material presented to us in a lecture format. Because of this, the most important element of training is building confidence and fighting imposter syndrome. Training can also give your organizational culture a lift.

There are a variety of training options available on a continuum of effort:

  1. Sink or swim
  2. Self-study curriculum
  3. Buddy system
  4. Ad hoc classes
  5. Systematic training programs

First, note that avoiding a “sink or swim” approach is important if you value inclusivity. “Sink or swim” breeds stress, frustration, attrition, and imposter syndrome.

Higher touch options signal leadership commitment to development, help ensure that everyone is speaking with one voice, and can reinforce desired behaviors to support a shift in organizational culture.

So what should you actually teach people?

The answer to this question is based on a few different factors:

  • maturity, 
  • familiarity, 
  • and experience.

Maturity refers to how far along on the SRE journey your organization is. Familiarity covers how familiar the individual is with your organization (are they new or have they worked here a long time?). Experience covers their background in SRE.

How to Build a Training Program for a Less Experienced Team

Here are some basic steps to get started if you are just starting out on your SRE journey as an organization.

Step 1: Address any skill gaps. Does your team have any common tools, defect tracking systems, or other necessary processes and knowledge?

Step 2: Know your team and tailor the message. For example, with people that have been in the organization for a long time, they may be resistant to adopting SRE principles, practices, and culture. They may think, “What’s in it for me?”. On the other hand, people new to the organization are more likely to go with the flow. If you have members of your team who have practiced SRE elsewhere, these are your catalysts. Let them share their stories.

Tailor your message for the people that make up your organization.

How to Build a Training Program for a High Maturity Case

What about organizations that already have an established SRE practice?

In this case, Step 1 involves assessing the team mix. Here, we have newbies, internal transfers, “old-timers,” and industry veterans. Once you assess this mix, you’ll want to take a look at what they all need

Newbies, for example, need to learn your infrastructure, systems, and ways of working, while internal transfers likely need to focus more on learning SRE principles and practices rather than your specific systems.

Now let’s consider what we add into our training program. You should look at both the what and the how. The what (training content) will be influenced by your mix of people, as we talked about above.

When looking at the how, consider how much effort you want to invest. The level of investment depends on (a) the size of your organization (b) how fast are you growing? For a small company, start with shadowing and mentoring. As the size of your organization increases, look at ongoing education options. Larger companies will get more benefit from investing in a structured training program.

What Can We Learn from SRE Principles and Apply to Training Operations?

Let’s get meta for a moment and talk about how SRE principles can be applied to running the training program itself. Consider the service reliability hierarchy, a framework highlighted in the original SRE Book that covers the elements that go into making a service reliable, from most foundational to most advanced. We can then develop a training hierarchy in order to apply key SRE principles to training program operations.

When developing your training, is more effort always better? No.

As with SRE practices, you should do just enough to meet the needs of your students. Keep them happy—but not too happy—and consider what tradeoffs you’re making when creating your training program.

You should also monitor your trainings. Get feedback from your students and iterate. 

For example, Google took feedback from their students that indicated training was passive and less engaging than they would have liked. So the team moved away from a lecture and made the training much more hands-on. They developed a training program that allowed the students to troubleshoot a problem. This provided them with immediate observable feedback on the effectiveness of the training. The monitoring reflected these improvements as well! 

Key Takeaways

To sum it up, let’s look at some key takeaways.

  • Training is an investment → An investment in your organization and people. 
  • Evaluate the cost and benefits → to make sure you make the right level of investment.
  • Decide where to invest → This depends on the what and how of your organizational circumstances.
  • Walk the Talk → Apply SRE / DevOps principles to the training program itself for a consistent and reliable experience.

Want to learn more? Read Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program

This post was written by Sylvia Fronczak. Sylvia is a software developer that has worked in various industries with various software methodologies. She’s currently focused on design practices that the whole team can own, understand, and evolve over time.

Photo by Serena Repice Lentini