The February 3, 2021 North American DevOps Group DOGCAST event featured Yuri Grinshteyn, Site Reliability Engineer @ Google
The full video and transcript from that event is below.
Yuri has also published an article on this subject:
“How to Alert on SLOs”
I’ve spent quite a bit of time looking at defining and configuring SLOs in Service Monitoring. And lately, I’ve been getting lots of questions about what happens next — once the SLO is configured, folks want to know how to use alerting to be notified about potential, imminent, and in-progress SLO violations. Service Monitoring provides SLO error budget burn alertsto accomplish just that, but using these alerts is not always intuitive. I set out to try these out for myself and document what I found along the way. Let’s see what happens!
Continue to the full article by Yuri Grinshteyn
DOGCAST – February 3, 2021
Jim Shilts (08:30):
We’re gonna jump into our feature talk. Again, if you have questions during the talk, feel free to type them into the chat. You can take yourself off mute if you’ve got an opening. And I believe speaker’s going to pause a couple times to ask questions as well. We’ll have some Q and A at the end. So we’d like to introduce everyone to our speaker Yuri Grinshteyn. He’s a site reliability engineer at Google, and he’s going to talk to us about “SLO Alerting Strategies.”
Yuri Grinshteyn (09:09):
Hey, Jim, thank you so much for having me can you confirm that you can see my screen and that you can hear me?
Jim Shilts (09:13):
We are very successful. We were sharing screens, seeing videos, all that good stuff.
Yuri Grinshteyn (09:19):
All right. We’re off to a strong start. Let’s try to keep it going. So thank you everyone. I really appreciate you joining me today. By way of introduction. My name is [inaudible]. I’m an SRE at Google. I’m part of what’s called a customer reliability engineering team. So unlike Google SRVS for work on Google systems we work with Google cloud customers to help them achieve the the appropriate level of reliability for their systems. if you are interested in more, what I have to say you can find me on medium and on YouTube as well. so today we’re going to be talking about alerting on SLS, right. but where I thought I would start is with some fundamentals, right? so I’m part of a site, reliability engineering team. And where I thought I would start is by talking about reliability, how do we actually define reliability?
Well, reliability certainly includes, you know, preventing outages and crashes like these that, you know, make their way into the papers that can have an immediate impact on, on your business, right? But the cost of systems that are unreliable actually extend far beyond just the big averages, right? There's one study that shows that a team of 10 engineers will spend about $300,000 a year worth of their time on things like troubleshooting incident management. Similarly IDC says that investing in reliability can improve or protect North of a million dollars of revenue each year. So the point that I can try to make here is that reliability is a feature of your system. It's a key feature of your service, just like all of its other features and just like all the other features. It means it should be prioritized, worked on, right. Not just taken for granted, but the next obvious question is, well, how much reliability should your service need?
Yuri Grinshteyn (11:00):
High reliable. Should it be okay? So why don't we talk about today from here is try to answer that question. How, how reliable should you service be? How do you actually measure it? How do you quantify and measure reliability? So I'll be talking about things called service level objectives or SLS and error budgets. And from there, I'll actually get into how do you use them? How do you generate alerts of off those? And then while I answer questions, as well as Jim had said, you can put questions in the chat, we'll try to keep an eye on those, and then I'll answer any questions you might have at the end as well. So let's get into it. I'm going to start by talking about error budgets, which is essentially the animating principle of site reliability engineering or Sr.
Yuri Grinshteyn (11:42):
Right? So this is basically how we measure reliability, right? There is a little bit of a, kind of a simple way to do this, which is how do you measure you? How much of a given time window was spent when things were good, right? This is essentially kind of goes back to a server all the time, right? This is pretty easy. We're all familiar with this. It's easy to understand. And it's easy to measure for things that are continuous and binary, like a server, a server is either up or down, right. But when we get into distributed systems and more complex systems, we need definitions. There are a lot more flexible, right? We need to be able to handle things like, is your service partially up? Is you the quality of your service partially degraded, right? If you have a load balance service some fraction of those instances behind a load bounce rate, maybe down how do you measure that?
Yuri Grinshteyn (12:35):
Right. But you still have the basic requirement, which is you need to define reliability and you need to measure it, right? So this is how we're actually going to talk about, these are the things we're going to use to measure our, realize that our reliability of our service instead of targeting on it, right? So the service level indicator or SLI is the actual measurement. It's the metric that we're going to measure and compare it against the threshold and tell us for any moment of time, whether the the reliability of our service, our service level is acceptable, acceptable, or not the service level objective or SLO is the goal for the aggregate of our indicator over a window of time, right? It's our mathematical bound, essentially, right? With a top value, being a hundred percent over a given of a window of time, which we call our compliance interval what is my target for availability or reliability rather?
Yuri Grinshteyn (13:27):
And then the amount by which I exceed my target is what's called my remaining air budget, right? The SLA, the sodas level agreement is mostly relevant to the business itself, right? It defines essentially a contract between you and the customers that says, what are you going to do if you fail to meet the objective, right? There could be financial penalties, things like that. Right. But the consequences are defined when both parties in the agreement agree to the SLA. All right. So now that we understand what these things are, the next step is to actually set that target, what is our reliability goal? Right. I'm sure you've heard the terms, right. Three nines, four nines as w when we talk about the availability or reliability of a service, but it's really about that, how much downtime are we will, are we willing to tolerate?
Yuri Grinshteyn (14:13):
Right. so for example, for a service that we say is three and a half nines available means it's availability or reliability is 99.95% over a given time window. That means we can basically take a 22 minute outage over a course of a 30 day window and still be within compliance. Right. But in a complex distributed system, it's really, you know, it's rarely be about being hard down, right? The number is really assuming a hundred percent outage, but this also gives you a way to calculate and measure your reliability when your service is partially degraded, right? So you can basically serve a 1% error rate from your service. And you can do that for 36 hours and still be within your Sol within your service level objective. So this is how we use solos to define targets for our reliability. And now you may be thinking, okay, well, this is all good and well, but my service needs to be available all the time.
Yuri Grinshteyn (15:04):
So like where's a hundred percent well, one of the key realizations that the guy that actually found that I, sorry, here at Google Benjamin Turner's loss I realized is that, you know, a hundred percent is really the wrong reliability target for just about anything. And this is where our budgets come from. And there are two main reasons why a hundred percent is always the wrong target. The first one is that basically there is a point, right, that optimal level for reliability of your system. What do you can balance the cost of reliability against the happiness of your users? Right? Most users are, are happy enough at a specific inflection point to take your reliability. Beyond that inflection point, it comes at an increasingly high cost, but it actually doesn't improve the happiness for your users that much. Right. and the second major reason is because just of cost, right, are the rough rule of thumb that we use when we approximate reliability cost, is that to essentially increase the reliability of your service by what we call one nine.
Yuri Grinshteyn (16:07):
So to go from 90 to 99% from 99 to 99.9, from 99.9 to 99.99, you can then look at a cost increase of about tenfold that that's going to include things like on-call staffing improvements in monitoring instrumentation improvements and alerting. So to shrink that time to, to detect more people, more time to work on things like automation, self-healing systems, resilience systems, and, you know, incident response, more time and more engineering to work on things like rollbacks and other effective mitigation strategies that actually allow you to quickly detect and resolve a specific issue within a within that very small window that you have as the reliability as these reliability levels go up. Right. Ultimately right. What we want to do is we need to pick a level of reliability that is going to meet the needs of our users, right? Are people going to notice when our reliability degrades beyond a certain point and really this is a product management decision, right?
Yuri Grinshteyn (17:09):
So the amount of unavailability that you're willing to tolerate, right? The difference between a hundred percent and the level that you set for your system is your air budget. Right? Then you need to measure that right. Once you know, how much of your budget you have left at any given time, then you are able to choose how you actually get to spend that air budget, and you spend it on doing things like rolling out new features, doing experiments, doing upgrades, all those kinds of things. This now gives you an objective way to measure reliability, and whether you're actually meeting the goals of a particular service, right? And when you start measuring and enforcing your budgets, you'll find that everyone who is a stakeholder for reliability of this particular service is actually to balance velocity with reliability. Everyone starts working together because everyone is working off a shared pool of on availability, right.
Yuri Grinshteyn (17:58):
And it also gives folks that are actually responsible for protecting and defending that air budget there. So I already use the Dell ops on call, whoever that may be it lets them know, you know, what to focus on at any given time. You can see, I know how much air budget do we have available. If it's a lot, we can work on it. Philocity if it's not a lot, we can work on reliability. And ultimately this is about pursuing maximum velocity within your reliability targets, right? We want to make all changes basically standard processes, right? As long as we have air budget available. So this actually increases the speed at which you're able to make changes, because the only decision you're really using, the only way you you really need to make a decision is whether or not you make a change and say, Hey, do we have error budget available? If we do. And we know what the potential impact to the air budget of a change is we can go ahead and do the thing we want to do if we want, we got to wait until we regained some air budget. Angela service is healthy for a peer.
Yuri Grinshteyn (18:57):
All right. So, so far we've kind of covered two main points that I hope you have absorbed. The first one is that the reliability of your service of your system is a key feature of it. And it requires a thoughtful approach. It requires work design decisions, architectural decisions just like all the other features of the system, right? Because it's a feature you should measure how successful you are at delivering that feature. You just set a target for how successful you need to be. And then when you are not successful, for any reason, you should have alerts that let you know that, Hey, you are having an issue that could potentially impact your your budget or your SLO. And that's where alerting comes in. So that's Rogan talk about next, right? So I'm going to talk about alerting, which essentially is what you do when there's a problem.
Yuri Grinshteyn (19:46):
Right. But monitoring, alerting obviously go hand in hand, right? You need monitoring you know, monitoring sort of the, the, the general senses is what metrics are you collecting about your system? Right? Part of monitoring is going to be monitoring your service level for, for your system, right? Is how do you decide whether the service is healthy enough? How do you aggregate that over time? How do you convert that into an error budget and all those kinds of things, right? And then you need to have high quality alerts that fire when your error budget is threatened. Right? One of the key things I want to emphasize here is you should alert on symptoms, right? Meaning you should alert on those things that you users actually experience, right? You don't want to alert on infrastructure things. You don't want to alert on what we call causes, right?
Yuri Grinshteyn (20:36):
So in this very simple example, right, the cause of a particular error may be that you're having an issue with the data, excuse me, sorry, there's this sneeze coming on and I'm trying to fight it. But you don't actually want to alert on the fact that you were having an issue with the database server, because there could be things to between the database server and the users that make this completely invisible to the users, right? You may have redundant so you, to your backend, you may have caching, right? But there could be a problem that's manifesting itself to the users. And that's the thing you want to alert on, right? You don't want to overload your on callers, your responders with too many alerts, too many pages, this results in alert, fatigue, ignoring alerts, and then when a really important alert comes in then they'll actually miss it.
Yuri Grinshteyn (21:23):
If you want to read more about this the the site reliability engineering book that we published a few years ago has has a chapter on this chapter six called monitoring distributed systems basically has a whole section on alerting on symptoms versus causes, right? But the idea is you basically want to alerts on specifically on things that actually impact your users on those things that degrade user happiness, rather than, you know, various things in your infrastructure, which may or may not be manifesting itself as actual user facing issues. So what do you actually do, how you do that? Well, once again, we turn to the literature this time we turn to the next book that we released called the site reliability workbook chapter five in basically everything I'm going to be talking about from here really is kind of a summary of that chapter, which was written by a bunch of folks. One of them, Stephen who's cited as the main author is actually someone on my team. So I'm sort of honored to be working on the same team as some of these folks. But this is where we talk about how you actually take your SLO as your service level objectives, and then use them to trigger alerts, to let you know that something is, something is actually wrong. As, so is there a question?
Yuri Grinshteyn (22:42):
Okay. as covered in that chapter, there are four considerations that go into your alerting strategy, right? The first one is precision. Precision is essentially the measure of what fraction of the alerts that are triggered are triggered by a real or a significant event. Right. Essentially if I have an alert, is it because something is really going on or not? Right. So what is your rate of false positives is another way to think about it recall is sort of the opposite side of that coin, which is what fraction of events result in an alert going off, right? So what fraction of all of the events that you want to be alerted on do in a factual, get an alert, right? Detection time is basically the time between when the event actually starts and when you receive an alert and then the recent time, which is the opposite of detection time, which is once the event is over once your SLA is recovering, how long does your, how long does it take before the alert stop stops firing?
Yuri Grinshteyn (23:43):
And as you think about the different thresholds that said, and the different things that you alert on, you kind of have to balance all of these, right? So for example as you improve precision, you may actually realize that it or sorry, as you improve recall, you may actually realize that your precision degrades, because your alerts become too Milky or too sensitive as you improve your detection time, you realize that, that it actually degrades your reset time. So some of the things do become a trade-offs become trade-offs against one another. And ultimately there's no right value that has no number. I can give you for any one of these that tells you this is the right value for you, right. It really depends on what you're trying to get at, but what you're looking for is to find that right balance for how sensitive and how precise your alerts are while minimizing operational load, which is that the number of alerts you really receive and respond to.
Yuri Grinshteyn (24:34):
Okay. And what we found is that the best way to balance these considerations is by alerting on air budget, right? This is where a little bit of math comes in. So let's get into it. Jim burn rate is a measure of how fast relative to your SLO your service consumed. Yeah. Our budget. So for example the line in in blue here, right? Your service you have a 30 day SLO. If you're serving consumes all of that air budget at the end of that 30 day window, that's called, that's a burn rate of one. Okay. You've burned all of your available air budget in the time that you set your set aside to burn it in a service that consumes all of the terror budget halfway through the evaluation period has a burn rate of two, right? As services consumes, all of the available air budget in three days has a burn rate of 10 because you basically burned it in one 10th of the time that you had, right? Yuri Grinshteyn (25:35): So this, these numbers arguing or budget burn rate, okay. We calculate the burn rate by essentially looking at three main inputs. We need to know that. Yeah. Hello. So what, in this case, as you know, we have the 99.9%, but that's the example I alone. When you tell them the evaluation period over what window do we measure that SLO, we say, how much of our air budget is consumed and what's the window over that over which that burn happens or essentially your alerting window. Okay. When you use these inputs to, to create this equation, right? Our burn rate is the amount of air budget consumed by our service multiplied by the evaluation period divided by the alerting window. Okay. So this is how we calculate our air budget burn rate. Okay. So that's all good and well, but you're the next question you always can ask is like, well, that's fine, but what does it actually mean?
Yuri Grinshteyn (26:31):
How much air budget should my service burn before I fire an alert? Right. So what should I set all of these things to? Okay. If you're just starting out with this, there's a kind of a basic recommendation. And again, the chapter five of the SRE that basically says you have kind of really three main recommendations basically two alerts for fast air budget burn, and one for slow air budget burn. And that would be 2% error budget consumption in one hour, and then 5% of error budget consumption in six hours. Right. And then for essentially for paging alerts, right? So those two alerts will be there to that. That would actually generate pages. And then another alert, which is the you've consumed 10% of your air budget over three days as a good baseline for it, for ticket alerts, essentially for your non paging alerts, right through the combination of these, you can both catch high error rates pretty quickly, and you could also cash lower error rates that it manifest themselves over time without overwhelming your oncologists with alerts.
Yuri Grinshteyn (27:31):
So this is a good place to start to kind of balance the four considerations I talked about, right. Balancing priests, precision recall, and then detection time and reset time. So what does that mean? Going back to air budget burn rate, right. Let's walk through through a specific example. Let's say our service has a SLO availability. It's a little 95%, right. And we measure that over a rolling 20 day window. And we want to create an alert that lets us know that we've consumed 2% of our air budget in the last hour. Right? So we go back to our equation and we calculate our, our, the burn rate that we would hit in that time. And which is 13.44. All right. So that's, that is the air budget burn rate threshold that we would need to exceed in order to trigger a fast air budget burner alert for this specific service.
Yuri Grinshteyn (28:22):
So for a specific place where we actually implement this, we can look at how alerting set up works. For example, in Google cloud and Google cloud has a low monitoring. This is where the error budget burn rate becomes important and important, right? To configure the alerts will need to specify. They'll look back window, which is the alerting windows in this case are it's 60 minutes. We also need to specify the burn rate threshold that we're going to head during that window. And we just calculated that to be a 13.44. And so now when our service has an issue, we'll see the error budget decrease, we'll see a burn rate go up and we'll get an alert that looks like this, right? It'll show us that. What was our burn rate for our given SLO over a particular incident? Right. I know that our alerting here is specifically letting us know that we are burning error budget at a rate higher than we want to.
Yuri Grinshteyn (29:10):
Right. There's nothing here. That's telling us why, like what's actually going on. Right? And so this is a difference between monitoring, which is telling us that there was a problem and observability, which would tell us, like, why what's the problem, right? And then that's it, there's a separate talk that probably better equipped person that I can give you as to the difference between monitoring and observability. Right? but I want you to very much think about avoiding this trap of alerting on root causes that I want an alert for every possible permutation of conditions. That'll quickly telling me like, what is problem, right? That'll be daring noisy. You'll get alert, fatigue, you'll suffer from operational overload. These are the kinds of alerts that actually let you know that, Hey, our users are having a problem. Our users are unhappy and we need to know about it.
Yuri Grinshteyn (29:54):
Okay. So that basically summarizes what I wanted to tell you today, right? The things that we talked that I talked about were basically that again, reliability is a key feature of your service. You should measure it. You should set targets for it. When you are, when your service is not meeting those targets, you can create alerts based on those. NSLS using air budget, burn thresholds to let you know that something is actually wrong. And then you can supplement those alerts with dashboards, with observability data to help you figure out what is actually happening and why that problem may be taking place. So I'll, I've got, I'd love to take questions that folks have at this point.
All right. Thank you, URI. I don't see any questions in the chat. You can still put your questions into the chat if you'd like, or this is the time where you can take yourself off of mute question and ask you the question is James stacks. Hey James looking at, you know, what would you actually, so we actually do monitor a lot of things that are more on the, you know, the infrastructure side, because those are the easy ones you just kind of understand, but understand, but what are some of the things that you would recommend to monitor for? And like, are we looking at like synthetics or just some examples.
Yuri Grinshteyn (31:13):
Yeah. it very much depends on obviously like what your service is, what it does, how folks consume it. The very sort of canonical textbook example, as you were talking about it and, you know, an HTTP request response service in which case there are going to be to what we call the signals that are absolutely the, the two first things you should monitor, which would be your error rate and your latency, right? If you have a basic, you know, basic, or maybe not basic, but like if you have an HTTP request response service, then it's, you know, what fraction of those requests are met with five hundreds that's your error rate, and then what's your 99 percentile latency for the responses. And as you monitor those set, you know, targets for those sort of, for error rate, you could say that, you know, your SLO is going to be, you know, 99% of my requests have to be responded with with a 200 over a 30 day window for latency.
Yuri Grinshteyn (32:07):
You could say whatever 95% of my requests have to be served within 500 milliseconds over that same 30 day window. And now you have two SLS that will probably get you, you know, 95% of the way there. Nice. All right, thanks. So for for other kinds of services, for example, for a data pipeline, right? You can approach that in a very different way. There is a chapter in these books on monitoring data pipelines, but are going to be looking for different indicators that it can be looking for things like data freshness, data correctness throughput, possibly we'll call data lag, which is how long does it take from when a data element enters the pipeline until it leaves things like that. So there's this sort of a different set of primary indicators that you would look for a data pipeline service, but for your basic request response services, it's going to be, I mean, availability latency are absolutely the, for the first two things.
Jim Shilts (32:59):
That's great. We're interested in actually both. So that makes sense. Thanks. Yeah, of course. Thank you, James. What other questions do we have for URI again? Oh, there's a question in the chat from song you what does inner budget replace? I'm curious, based on history, what did business and tool owners use before this got developed? What didn't work in the past that caused the creation of this new terminology? Okay.
Yuri Grinshteyn (33:28):
I've stopped sharing so I can actually hit the questions now. What did air budget replace? Yeah. good question. Air budget replay is basically the, your basic, let me know if my error rate goes up for the last five minutes, right. Or if my latency goes up in the last five minutes, right. So you would still be using the same signals by, you would just alert on the very recent condition, like what's going on within the last five minutes or the last 10 minutes or the last half hour or whatever the case may be. Right. So the thing that I did that, that doesn't work about it is that kind of going back to why we run, we, you Google run things. The way that we do is it is impossible to scale services to, you know, quote unquote Google scale without drastically reducing the number of alerts that you get such that the operational load doesn't become completely prohibitive.
Yuri Grinshteyn (34:29):
Right. you know, I hate to resort to cliches, but we, at this point have 10 services, each of which have a billion active users each right. We couldn't do that if we didn't do things in a different way. And what we realized is we had to spend a lot of our time investing in automation and self healing systems and things like that. The only way we could do that is to, you know, again, without like massively growing our operational staff to like literally millions of people is to invest a lot of engineering time in automation. And so what that meant was we had to treat operations like a software engineering problem, and we had to really manage how much operational load these folks are going to be taking on. And that's what resulted in us starting to use especially saying like we know a hundred percent, it's not the right target, which means we don't need to know about every little glitch, every little flicker, right.
Yuri Grinshteyn (35:24):
We're willing to know when things are degrading. And we do that by saying, what are the actual reliability requirements for a given service beyond which users will really be unhappy. So that sets us our SLO and how quickly can the health of the service degrade before we really need to know about it. And that's what gives us the air budget and the air budget burn, right? So air budgets replace your more traditional. Let me know if blatancy goes up over the last five minutes alerts and it's, it's a way to basically allow us to scale traffic scale users, scale scale, the complexity of our systems in multiple dimensions, the number of services, the complexity of what each of these services do, and then the traffic of each of these services without linearly scaling the operational staff, or really the operational load that comes with them.
Jim Shilts (36:22):
Thanks for the question. So what other questions do we have again, you can take yourself off of mute and ask your question, or you can type into the chat, sing your question. Okay, go ahead. How do you recover from a, you know, a situation that, you know, the last several occurrences, you have a bad air budget and you finally got a resolution to it, but then, you know, how do you gain that trust back that those services are more, are going to be more reliable. So how do you, how do you gain your reputation back that the services are back to a normal running order?
Yuri Grinshteyn (37:07):
Yeah. I mean, I, ideally because you're using alerting that let you know that you are burning your budget long. The reason that we use these, you know, air budget, burner alerts is because it's basically a way for you to say, let me know when I am heading in the wrong direction, but long before I'm actually out of SLO, right? So you still, within SLO your service is still meeting its reliability, target things are just not doing as well as they should be. Right. and so you you get this alert, you shift into incident response mode, you mitigate the issue right through whatever means are available to you. Maybe it's a rollback, maybe it's traffic routing, maybe it's emergency capacity, right? Whatever the case may be. And then you get to work on addressing the root cause and writing the retrospective and all that sort of thing.
Yuri Grinshteyn (38:01):
But because you've hopefully caught the incident before you were actually out of SLO and you were able to mitigate the the incident quickly or relatively quickly, you know, you still have some air budget remaining for your evaluation period. But the way we actually put this into practice in L or one of the ways that we put this into practice is that we would have a weekly, what we call a production meeting, where both the folks that are responsible for feature development, as well as those who are responsible for operations, whether it's SRE or dev ops or whatever you want to call this team get together meeting we'll review, like what that we have over the last week. When did our pager go off? What was the root cause? What action items did we identify from that, that need to be implemented in order to prevent that particular issue or that particular class of issue from reoccurring.
Yuri Grinshteyn (38:48):
And this is the biggest part of it honestly, is folding that work right back into the engineering sprint, or however you, you plan your engineering work. So it's not just about saying, okay, like we, you know, the pager went off. We, you, like, there was a problem. We, we mitigated or we fixed it. Like let's move on. It's always about taking what you learned from these incidents and applying them moving forward. So that over time you're actually making a service more resilient, you're improving its reliability, or you're maintaining the desired level of reliability while decreasing operational load. And so the way you regain trust is by essentially shifting the work of the entire team, both the operations team, whatever that means in your context and the development team, or the one dev ops team that there is for the service from feature work to reliability, to make sure that you basically get that air budget back, that your service recovers air budgeted, because you don't have more incidents. So it's really important to have this tight alignment between the folks that are carrying the pager and the folks that are building the features and doing deploys so that the, the things that come out of incidents, the things that you learn when the pager goes off, actually get acted upon. Does that answer your question? I know that was a bit long, a bit long-winded thank you very much. Of course. I think at the same time someone else had a question as well.
Yeah. Here is James again so less tactical, more you know kind of process communication. One of our challenges, you know, being, you know out of an R and D department is, Hey, I deliver features. Let's just go get them out there running. And then they come out and say, Hey, can you do three nines because somebody wants it. And we're like, well, let's, let's have a, it's a bigger conversation that what are some of the important things that should be brought up in these conversations and this literally happened today. So perfect timing.
Yuri Grinshteyn (40:56):
Yeah. I'm so glad, sorry, just maybe my camera's a little off my camera's a little wonky. So how do you actually have that conversation? So the very first question, and I, you know, again, because I work with cloud customers to help them with reliability issues, this is the kind of thing that I deal with every day, right. I work with customers and they say, Hey, we're building on in my case, Google cloud, but it really doesn't matter. And yet we want to have our service needs to be five nines available. Right. And my very first question is always going to be why, how, how did you arrive at that number? Where does that number come from? And, and from there, it's a matter typically it's a matter of kind of helping them, helping the customer whether the customer's internal or external is understanding what are the implications of such a, such a tight SLO that the chart that I showed before the basically shows you your unavailability window within a given time interval at certain levels is off, is often quite helpful now. And that's in the SRE book. So you don't have to go far to find it is at the higher levels. I think at a service that's five nines. You can only afford like a five minute downtime a month, right?
Yuri Grinshteyn (42:11):
There are not many services that can detect an outage, detect an issue and restore the service in five minutes. Right. and so it's a matter of, so like, what are the practical considerations that go into achieving these high levels of reliability? It basically means that your system needs to be most of the time, you know, global distributed sort of elastic, not if not infinitely, elastic than certainly practically infinitely elastic. And self-healing in a lot of ways, which is just going to require a lot of infrastructure and automation work. And if you're up for that, and that's the realistic target, that's the real target for your service, then, you know, you'll have a, you have a good project on your hands. But most of the time, what I've found is kind of discussing what are the sort of the immediate implications of as your higher availability target, which means like how much either partial or full downtime can, does that actually give us, and what do we have in place that lets us meet those targets, right?
Yuri Grinshteyn (43:13):
So if you want to build a four nine service, that means you can take a 30 minute outage a month, can you detect triage and mitigate an incident in 30 minutes today? And if not, what do you have to do to get there? Right? What do you have to do to reduce your time, to detect time, to mitigate time, to roll back, right? Do you have all the mechanisms in place today? So that's how generally I approach these conversations is first understanding where the requirement comes from. And secondly, get into the implications of what that really means as to how you're going to implement something that will actually hit those targets. And then really what that'll cost.
No, that's fair enough. That's kind of where we're going. They just want to align directly with the service providers say, Hey, your service provider is providing this. And I'm like, well, one of the services providing that. And then here's the other services that are there and they're less on top of it. Here's the other things that can go wrong. And, you know, your service providers are gonna pay pennies on the dollar to what you may be legally responsible for. So we're not engineered to support that. Are you willing to pay for that? So I just want to know if there's any other nuggets in there that that you would have that conversation with really business units that, that, that really, this is the first time they've kind of walked through this section size. Is there a lot of best efforts? And now they're, you know, people are, have high expectations, you know, they should, but we need that. We want to make sure we're building around in the right format and structure to the type of product of it is not, not all products are equal.
Yuri Grinshteyn (44:52):
Absolutely. Yeah. Yeah. I would say the, the rough rule of thumb that it takes to, you know, 10 X the cost to get one additional nine of reliability. It's usually enough to make people think twice about just how much are they willing to pay for it?
Yeah, that's what I, like. I got, I got to steal that one.
Yuri Grinshteyn (45:08):
You don't have to steal it. It's free, free for the taking.
Yeah. Thank you, James. What other questions do we have for you? Or we still have about 10 minutes reserved here for additional questions. So if you've got one chair, yeah.
Yuri Grinshteyn (45:28):
Have you had scenarios where customers run into issues and the problem is not tracked and monitoring, shouldn't that factor in your error budget? Is there a way to account for that? Yeah, that's a great question. So has that ever happened? Yes, absolutely. Is that common? No. if you build your monitoring, you will, the focus on user experience and you actually measure the things that your customers are seeing. It should be very rare that there's a problem that impacts users that you're not aware of. Right. there could be issues that, yeah, don't manifest themselves as quote-unquote errors, for example. Right. So you can imagine a banking app that shows somebody the wrong balance. It's really gonna be difficult to build monitoring that for every balance request, basically validates the payload to say like, what do we send the user? And does it actually match like what we thought we needed to send, right.
Yuri Grinshteyn (46:30):
Because you sort of like duplicating that business logic and you're monitoring, you know, you'll probably just going to say like, w w w do we respond with a 200 when they request their balance? So there could be times when your application is the logic is faulty and it's not manifesting itself as an error in your and your monitoring. Right. So yes, that, that could happen. Will you then go and manually adjust your available air budget for the year current window, whether it's a month or a week, or however you measure it to reflect those kinds of things? Honestly, that's, it's, it's very what we call toilet, which is court requires a lot of toil to do that. And, and probably not sustainable over time. So, you know, my recommendation is you want to spend that time upfront when you're designing the monitoring to really identify.
Yuri Grinshteyn (47:23):
And if you're interested in how we do this, we actually have the materials that we do use to do this. We've made public. Let's see if I can find the link and put it in the chat here. There's a, yeah, here we go. This was actually built by my team. Let's see if I can send it to everyone. So this is called the art of SLS workshop. So this is a, basically, there's a presentation that kind of takes you through a longer version of what we, what we talked about today, but also helps you to facilitate actually designing and identifying your, your indicators, the targets on them and things like that. But as you go through this process, it's really important to focus on first, what is your service? How do you define a service? Secondly, how do you define critical user journeys for that service from there? What are the indicators that tell you that your users are successfully completing those journeys, do as your service level indicators, your SLIs, and then finally, what are the targets to set for those indicators? Are you building your SLS? If you do sort of a conscientious job through those four steps, your monitoring should be pretty good, right? It'll be very rare that you really will run into a scenario where customers have issues that aren't caught by your monitoring. I hope, I hope that helps.
Jim Shilts (48:55):
Thank you for that question. I don't see any more in the chat. Anybody else have a question like to come off of mute and ask URI while we have him here, don't see any more there and no more in the chat. So we're going to end the, the main part of the the online event here, URI. Thank you very much for sharing with everybody.
Yuri Grinshteyn (49:22):
I mean, it's an honor to be here.
Jim Shilts (49:25):
Yeah. Thank you. Yeah. We appreciate you and appreciate all the questions. For everyone watching on YouTube, we're going to, we're going to stay on and socialize a little bit, and obviously you don't get to do that with us, and we're not going to record that part of it. So George say goodbye to everyone on YouTube.