Ben Treynor Sloss keeps the trains running at Google. He has a long history in software engineering and is now the leader of Google’s SRE team. Vidhya Narayanan is the director of engineering in the SRE department at Google. She, like Ben, hadn’t had operational objectives prior to joining the SRE team at Google. They both have a deep appreciation for the way Google has successfully set up and scaled SRE within the company.
What Is SRE?
In short, site reliability engineering (SRE) is all about supplying the services needed to support running software. This service management might be handled by the developers, by a separate operations team, or in a hybrid model where members of the operations team are the developers who are automating service management. This last, hybrid model is what constitutes SRE.
The benefits of SRE are the same as most automation efforts. Automation is used to automate repetitive tasks, freeing up people’s time so they can focus on areas that cannot be automated.
How to Start Doing SRE
In order to implement SRE, you first need to make the intent to do SRE explicit. Set goals and hire the best software engineers and developers to take on the tasks involved in SRE.
Then, ask “How do we measure success for our customers?” Customers, in this case, are primarily development teams working on business software. But you can also think of the business as a whole as another customer in the sense that the running applications are critical to the business. Then, there’s the user customer, who wants the system to be always available. SRE lives at the crossroads between these stakeholders.
How Is SRE Different From Operations?
SRE differs from operations in that it is an engineering role—the “E” in SRE. Traditional operations teams are more about configuration and management of operational systems, where SRE teams take an engineering approach to the same types of problems.
SRE is similar to traditional operations in its objectives. It seeks to standardize the tools used to run applications, to minimize downtime, and to control costs. But the key difference is in how these objectives are achieved.
In a traditional environment where operations staff simply provisions servers or VMs for application users, the developers are left to sort out the details of running the application. This can lead to a disconnect between the objectives of the two functional groups.
SRE engineers, on the other hand, are developers or engineers who use infrastructure as code to standardize the runtime environment. With tools like this, they help application developers focus on the product rather than on how to run the product.
SLOs and Microservices: Is It Enough?
Thinking about service-level objectives (SLOs) and how they apply to SRE, is it enough to have an SLO? It is not! You have to validate the system to ensure it actually meets the objective.
Although a system, let’s say a microservices system with a pub/sub, may be designed to meet the SLO, it might not actually hold up to its objective under load. Google does planned tests that tax the system to see how it actually holds up and recovers from the stress.
Of course, it’s just as important to take a realistic view of SLO in the first place. Stuff happens, and a realistic SLO is necessary to account not only for this but also for things like experimentation.
A Day in the Life of an SRE
Now that we understand SRE from an operational perspective, let’s take a closer look at the SRE role.
An SRE’s day-to-day workload is fast paced and involves many decision points. For example, who do you need to reach out to when there is a problem? And how can application developers break up a monolith?
SREs need to be involved in the architecture and design of applications and systems. They need to work closely with change management. Organizations using SRE and DevOps need to look at the broader picture and pull together standards to keep from duplicating efforts. These and other various challenges require an array of skills and certain personality traits to pull off well.
What Skill Sets Are Important for SRE Engineers?
So, now that you have a window into the day in the life of an SRE, you might wonder what skills an SRE needs. Google has a standard set of code rubrics that apply to application engineers and SREs. However, there are additional criteria for SREs. Here are a few:
That last point is particularly important at Google, as the advantage of being an SRE is you get to see through the organization’s boundaries and look at the system holistically.
Key SRE Points
SRE means keeping the trains running, but it isn’t traditional operations management—it’s engineering applied to operations. SREs need to be skilled engineers, and the best SREs will have some software development experience. The work is fast paced and requires people with good decision-making skills. An SRE’s goal is to make everyone happy, from the business to the application developers to the users. SREs do it all!
This session was summarized by Phil Vuollet. Phil leads software engineers on the path to high levels of productivity. He writes about topics relevant to technology and business, occasionally gives talks on the same topics, and is a family man who enjoys playing soccer and board games with his children.