How long does it take for your organization to deploy a new release to production? For some organizations, it takes a week to deploy to production, while other organizations only need two to three hours. No matter where you are on the scale, it's possible to reduce the time required for a new deployment to be promoted to production.
Many organizations try to reduce time to deployment by implementing CI/CD pipelines, GitOps, and internal platforms or by improving the developer experience. At first, everyone is happy with the new approach. Developers can deliver to production much faster.
However, with the GitOps approach or an internal platform, developers discover they are now responsible for monitoring their application and setting appropriate alerts. Wasn't this the task of an Ops team? In other words, as a consequence of adopting one of those approaches, the responsibilities of the application development teams shifted: they are now responsible for all visibility aspects of their applications.
Application developers didn't expect this sudden shift in responsibilities. The Ops team is getting phone calls and Slack messages from developers complaining that they have to instrument visibility. On top of that, services are failing, and the development teams have to deal with many outages.
So, how can we solve this?
Do You Even SRE?
We can fix the above problems with site reliability engineering (SRE). SRE uses software to manage IT operations and automate IT tasks that are often executed manually by the Ops team. An SRE team consists of a wide variety of people with different skills, such as infrastructure engineers, developers, and Ops people. By having a rich skill set, the team can use different domains to improve system resilience and availability.
Most teams struggle with the following four areas:
- Dependencies: Do you know what you impact and what impacts you?
- Observability: Anything weird going on?
- Health: Are your app and its infrastructure healthy, sick, or dead?
- Traceability: Focus on the customer perspective.
An SRE team can help manage and clarify these four areas. When we look at the dependencies area, teams often know what's going on in their proximity. But when they look two or three product teams away, they have no idea what impacts them or how they impact others. For instance, team A probably doesn't know the day-to-day tasks team D performs or the specifics of team D’s commitment to team E. An SRE team can keep track of all these small details.
Moreover, the SRE team manages observability. If something weird is going on, they can see where the problem occurs across all domains and take appropriate action. A development team doesn't have access to global visibility metrics and shouldn't be concerned with this.
Furthermore, an SRE team collects detailed health metrics about applications. For example, an application team can only see a green or red light on their dashboard indicating their application status, whereas the SRE team has access to much more detailed metrics.
And lastly, the SRE team implements traceability to give application developers more context to debug problems. Traceability is a pivotal element to speed up the resolution time for bugs.
So, why do we want to address those four key areas?
Why Implement a Site Reliability Engineering Team?
SRE teams help you understand your systems’ health and reduce MTT (mean time to) metrics, like the mean time to repair (MTTR) or the mean time between failures (MTBF).
By implementing SRE practices, Jeremy Castle and Andy Hinegardner at State Farm could reduce the recovery time from one or two hours to 54 minutes. In a second incident, they reduced the recovery time further to nine minutes. This incredible progress was made by measuring key health metrics for their systems and applications and making informed decisions.
Conclusion
When you think you're finished improving your DevOps approach, you might find out you've missed some key areas. While implementing CI/CD pipelines, GitOps, and an internal platform might look promising, your application developers could complain about outages and a lack of visibility.
Castle and Hinegardner found that they were missing four key areas (dependencies, observability, health, and traceability), which an SRE team can address. After creating an SRE team, Castle and Hinegardner saw performance skyrocket.
In short, the Ops in DevOps is a verb, not a noun. There’s always room for improvement. Ops isn’t a one-and-done thing. You always need to look for ways to improve your DevOps setup. Maybe site reliability engineering is what will take your organization to the next level.