SLO Alerting Strategies - NADOG DevOps Webcast

The February 3, 2021 North American DevOps Group DOGCAST event featured Yuri Grinshteyn, Site Reliability Engineer @ Google

The full video and transcript from that event is below.

Yuri has also published an article on this subject:

“How to Alert on SLOs”

Yuri Grinshteyn

Sep 28, 2020 · 8 min read

I’ve spent quite a bit of time looking at defining and configuring SLOs in Service Monitoring. And lately, I’ve been getting lots of questions about what happens next — once the SLO is configured, folks want to know how to use alerting to be notified about potential, imminent, and in-progress SLO violations. Service Monitoring provides SLO error budget burn alertsto accomplish just that, but using these alerts is not always intuitive. I set out to try these out for myself and document what I found along the way. Let’s see what happens!

Continue to the full article by Yuri Grinshteyn

DOGCAST – February 3, 2021


Jim Shilts (08:30):
We’re gonna jump into our feature talk. Again, if you have questions during the talk, feel free to type them into the chat. You can take yourself off mute if you’ve got an opening. And I believe speaker’s going to pause a couple times to ask questions as well. We’ll have some Q and A at the end. So we’d like to introduce everyone to our speaker Yuri Grinshteyn. He’s a site reliability engineer at Google, and he’s going to talk to us about “SLO Alerting Strategies.”

Yuri Grinshteyn (09:09):
Hey, Jim, thank you so much for having me can you confirm that you can see my screen and that you can hear me?

Jim Shilts (09:13):
We are very successful. We were sharing screens, seeing videos, all that good stuff.

Yuri Grinshteyn (09:19):
All right. We’re off to a strong start. Let’s try to keep it going. So thank you everyone. I really appreciate you joining me today. By way of introduction. My name is [inaudible]. I’m an SRE at Google. I’m part of what’s called a customer reliability engineering team. So unlike Google SRVS for work on Google systems we work with Google cloud customers to help them achieve the the appropriate level of reliability for their systems. if you are interested in more, what I have to say you can find me on medium and on YouTube as well. so today we’re going to be talking about alerting on SLS, right. but where I thought I would start is with some fundamentals, right? so I’m part of a site, reliability engineering team. And where I thought I would start is by talking about reliability, how do we actually define reliability?

