All Day DevOps - Blog

Observability Made Easy

Written by Erik Dietrich | Nov 6, 2019 8:01:55 PM

When Christina Yakomin (@SREChristina) started her journey toward synthetic monitoring, she owned a platform for containerized applications and all of the underlying infrastructure. But she didn't own the applications themselves that were deployed to that infrastructure. This consisted of some application servers, cache servers, and web servers.

When she came onto the team, they had robust monitoring in place.

Screenshot from Christina Yakomin’s “Observability Made Easy” presentation.

The Problem: Defining Healthy

But, in spite of that, it was actually pretty difficult to define what "healthy" was going to mean. For the platform, they decided to broadly consider that anything below a 500 is healthy, and that anything faster than three seconds would be a healthy response time.

They defined alerts for all of this, and everything was great… at least until she was on call for the first time. She was awakened by many calls, including false alarms.

So what gave?

Monitoring Percentages

In her case, it was that a small number of apps represented a large portion of the total traffic. So, anything happening to those apps disproportionately skewed the aggregate metrics and sent her a false alarm.

What to do? Well, next up was to monitor the percentage of healthy services.

Screenshot from Christina Yakomin’s “Observability Made Easy” presentation

But the false alarms continued. Why?

Well, because a small number of bad services were throwing off the service level.

Synthetic Monitoring

Next up, they decided to look into synthetic monitoring: artificially generated traffic to an application, mimicking the patterns of a typical user. This has a few advantages:

  1. Traffic is controlled.
  2. It complements real-traffic monitoring.
  3. It mimics real user patterns.

Here was their first iteration of synthetic monitoring:

Screenshot from Christina Yakomin’s “Observability Made Easy” presentation.

Providing this REST API for the synthetic monitoring allowed her to generate this interesting visualization of traffic through the system.

Screenshot from Christina Yakomin’s “Observability Made Easy” presentation.

The thickness of the line represents the traffic volume flowing from point A to point B, through the system. This is a good example where the load balancer is working and the lines are even. The darkness of the line indicates latency, per the legend on the side of the screen.

So, at a glance, you can see where requests are going, how evenly, and how long it's taking. Alerts can be more sensitive while generating fewer false alarms.

However, there was a flaw. The API and the monitoring tool were in the same container. So, if anything happened to the container, it also happened to the monitoring capability.

Isolation

So she needed a layer of isolation. She needed to separate the API from the monitoring.

Here's the result of that.

Screenshot from Christina Yakomin’s “Observability Made Easy” presentation.

On the left are CloudWatch rules, which, on a chron schedule invoke the state machine and goes to two lambda functions. The first one talks to the secrets manager to grab credentials, and uses them to talk to the auth provider, which supplies a token. This is passed along to the second lambda function, which makes the REST request via a Python script. It makes the request and stores the results in the CloudWatch log output.

Why is this better? Well:

  1. It's isolated, so it can be available independent of the platform's availability.
  2. It's serverless, so it's cheap and even more highly available.
  3. It's global, which allows them to source their monitoring from anywhere that runs AWS Lambda, allowing for more accuracy in estimating client experience for global clientele.

So why not go with a vendor solution?

Well, vendor solutions are great, but she already had some of the components for this in house. And the products that she evaluated had some shortcomings, such as a lack of auth integration, expensiveness, and lockdown by a security team. With nothing ideal, and the simplicity of the Python approach, it seemed like a good idea to do it this way.

So if you find yourself in a situation like Christina's, where a vendor solution just won't work, perhaps hers can serve as the inspiration for creating what you need.

 

This post was written by Erik Dietrich. Erik is a veteran of the software world and has occupied just about every position in it: developer, architect, manager, CIO, and, eventually, independent management and strategy consultant. This breadth of experience has allowed him to speak to all industry personas and to write several books and countless blog posts on dozens of sites.

Photo by Zachary Spears