Self-Service Monitoring through versioned Infrastructure and Configuration as Code

Self-Service Monitoring through versioned Infrastructure and Configuration as Code

self-service monitoring - nadir - devops blog
Carlos Munoz Robles

Carlos Munoz Robles

Global e2e Monitoring Lead at Allianz Technology

In this talk, we show the audience how we combined our experience as MaaS (Monitoring as a Service) provider to create an easy to use, easy to update and easy to rollback `Infrastructure as Code Self-Service Monitoring solution. This faces updating, misconfiguration and diversification issues, by using a fully-automated CI/CD Pipeline that covers both system-wide and customer-specific settings.

We further discuss our self-service system, which enables us to decrease maintenance and to avoid manual steps by using automated deployment of large amounts of instances, as well as depict the idea of saving all configuration inside version control, while totally ignoring the users’ UI changes.

Watch the full video:

“Self Service Monitoring through Versioned Infrastructure and Configuration as Code”

Slides:

Transcript:

“Self Service Monitoring through Versioned Infrastructure and Configuration as Code”

Carlos Munoz Robles (08:56):
Since a year ago, leading the global interland monitoring initiative. And when I took over this let’s say monitoring role within Allianz. I got, this is like, I would like to talk first of all, about the legacy. This is how, how our monitoring landscape look like as you can see, it’s pretty cute and granular. We have roughly 50 different monitoring tools and a huge overlap between them. And even though with this 50 different monitoring tools, we didn’t see any improvement or at least saw a decrease in the number of outages that were happening also on a regular basis. And one thing that we realized that we didn’t have the correlation in between technologies between business applications, infrastructure, and network. So whenever we were suffering an outage, we were missing DSLR, correlation and traceability, you know, between all the layers. And in addition to that some of these monitoring tools were managed by IBM at the center or another third party, and we didn’t have the fully control of them.

Carlos Munoz Robles (10:09):
And with this bullet points, you can imagine that we were mostly reactive instead of following a proactive approach. And these are the army thrillers, our motivation. We are honest technology. We are the service provider within the Allianz group and the things that we need to tackle at the end of the reliability and the stability of our, let’s say a business application of shared services. And of course provides a single pane of glass. It doesn’t make any sense that whenever we have an operational entity that is like Allianz, Germany, Allianz, North America, they are asking the visibility of their data ecosystem. You cannot share like 500 different tools to get this full correlation. And also what ideas to take advantage of for this AI ops that you can get from any APM solution application performance monitoring in order to improve the root cause analysis. And also the meantime to repair that is, is this is really important and you will see that CMDB is going to play an important role within this initiative is going to be our single source of truth to populate the arrows that are coming from our underlying layers.

Carlos Munoz Robles (11:17):
So the common disadvantages of the legacy that I took over is that we were missing a single pane of glass, the majority of their allowed for the manually also the, we didn’t have any traceability in the configuration side. You know, all the changes, all the settings were changed, circle the user interface for different people. And of course this cost problems at the end, our future, but lapis has stayed at the beginning between different tools. We have like three different APM solutions that dynamic new Relic, and then a trace. And the automation was for more than a lot of the tools. I mean, we have the promise of ecosystem. We have some automation in place for the others. And also one thing that we do realize is that there is no monitoring tool in the world that is going to give you the full visibility this end to end monitoring.

Carlos Munoz Robles (12:14):
We run a tender with Instana dynamics, new Relic, and then a trace on a, one of them was able to give us the full visibility whenever you want to go deeper in the infrastructure layer and get the velocity of the fan, the temperature of the CPU. You need to have specific monitoring tool and pretty much the same for the network layer for this full packet inspection capabilities. And in order to fix this issue, as soon as possible, this is how our first iteration looks like on the bottom. You can see some of the monitoring tools that I shared in the first slide, all these, the one that I was able to find our logo in Google, and what we are doing as of now is to ship all the critical events to SIS thesis and even management solution. I think it’s a German one. You can think about CAFCA or nobody, one management solution.

Carlos Munoz Robles (13:07):
And here we are gathering all the critical events from our underlying layers. And we go later on to our work CMDB and ask for the configuration items that has been affected for such critical events. And we have item in place that is also doing the services so to discovery and mapping between all our assets. And later, later on our enrichment layer is probating. The critical errors into Lana trace Dynatrace at the time was the winner of the tender for sobre recent. And our idea is to keep Dynatrace as a single pane of glass. Otherwise we will have the same issue of sharing like five, four different user interface to our end customers. So this is where the up and running is working quite well. It took us like two, three months to build this solution. And later on, we have some time to define the desired state and after a few brainstorming our team decided the next solution is needs to be maintainable.

Carlos Munoz Robles (14:06):
And when I say, when I say mine, tenable is like, we have 7,000 developers within the group. So one idea is to foster develop this DevOps culture give the responsibility of the settings and the configuration to the customer side. So it’s up to the teams, you know, in DevOps, we have the monitoring is one of the pillars. So they should be the responsible of their own monitoring as well. Also that the solution should be human readable. Let’s go for demo fives instead of this complex, huge XML files. And as you can imagine, we have right now, 67 operational entities, worldwide 7,000 developers are super queue from time to time is born in. So whenever we have an issue, we should be able to reproduce almost immediately. And also the then future solutions should be traceable on version. This will give us the chance to roll back to the previous configuration.

Carlos Munoz Robles (15:00):
In case we have an issue and learn from our mistakes and last but not least automate as much as possible to avoid the monkey job that no one wants to do. So the first challenge was to onboard this 7,000 developers. Of course we will like to handle the fundamental way we, we did it for a few weeks, but of course, so when we were spending the 80% of our time, you know, creating these [inaudible] within Dynatrace, the filter, the, that, that for each one of the, let’s say technologies from mainframe for AWS HR coordinators, we built a self-service portal. So now it’s up to the customer and I will show you how the workflow looks like because we have service now, that is also an important building block within Allianz. And here we created a, let’s say an onboarding workflow. And in that case any developer can go that to service now with our Allianz certificate and order our global end to end monitoring service.

Carlos Munoz Robles (16:05):
And they have a few fields and, you know, the country by the name of the team user management and the cost center, there is a kind of a critic car within Allianz to grow charts of what service and capabilities we are creating or, or, or pushing up a lot of the changes or this is the data, the payload, sorry, the payload from the service now form it up into our Ansible inventory. So we have an Ansible mentally with all the data that are coming from service now, and we have everything in GitHub. We are able to create a workbook at trigger a Jenkins job automatically every time that we have a request, and this can still is running an Ansible playbook that is triggering the API calls into Dynatrace to create this as the request from the customer site and down, it looks simple, but it’s if you cannot imagine the time that you are going to save, and then now I think you will see the figures that we have on our days.

Carlos Munoz Robles (17:10):
I think that we avoid like the 80 or so of the support tickets that we were getting on a daily basis. But now let’s say that we provide like an, a kind of a selected segment for our customers to handle their own monitoring. But my team, no one wants to, to handle the farther configuration, like, okay, I need the maintenance window. I need to onboard my AWS account. I want to take the time response of my databases and such things. So what we did this to try to, you know, this, a paradigm shift and provide the visibility of such settings in a self-service fashion as well, and what are doing whenever we onboard a new customer, we share with them, I give her a report and you’ll see a demo afterwards. And whenever they need the new Mountain’s windows and our new, all of the profile, a Berlin notification to a flag to Microsoft teams have whatsoever.

Carlos Munoz Robles (18:05):
They had the freedom to go there to create a new demo file. Automatically you will trigger. I think his job that is also on our site that will do the, the cost to then our choice to create this a new configuration. And this is now let’s say manage without any human interaction from our team. And I want to share first how our inventory looks like before we have 67 operational entities like countries worldwide. And we have for like a level four DOE here, we’ll have an example for Italia. Another one for [inaudible] Oh, Germany. On top we create like another topic well for, for the the Bowman teams , as you can see the cost center, the user ministration, and this is important because from time to time, you need to cross collaborate in project before the race, and you can see my name in different teams, and they will, we’ll give you the visibility of the assets coming from one place and the assets that you are using with your isolated segment.

Carlos Munoz Robles (19:13):
And this is what we don’t want them for. Whenever we get the request in service, now we will, the payload will go through our microservice that we’ll have data with anti-bullying mentoring. The Incas job will trigger the the Ansible playbook and we will get the customer on board within a few minutes. Then after six months, this is what we got. I mean without any human interaction, you just need to take a word dashboards to realize that we got a future load coming from maybe another country that we didn’t even realize. We have more than 300 applications fully monitor this disfigured increase yesterday. And I hell we have today 9,000 servers of hosts. And when I say hosted servers by metal coordinate these containers, and from coming from 18 different data centers, our on prem data centers and the one that are coming from AWS Tanisha, and more than 2000 users.

Carlos Munoz Robles (20:07):
And I think this is really important, important whenever you would want to, you know, have this cross collaboration between other teams. And now we are really happy. I mean, we were handling this manually on the beginning, but thanks to the self service portal. Now we support the customer to onboard by themselves. And of course they get an email with instructions to install it. Then there’s one agent in all the different ones and platforms that we have in Allianz. But yeah, I mean, we’ll distant from them whenever we have an issue in our platform, but now let’s talk about that. What does not configuration how we, we like to give visibility to the sales team, so to our customers. And why do we need to talk about this is because when you talk about monitoring in that case, the Dynatrace is an agent based solution and the customers, for instance, they will like to switch off or switch on the agent, you know, for, for budget purposes, because this is a vendor solution.

Carlos Munoz Robles (21:14):
We are offering a PSW approach. We do have [inaudible], you want to shut down the monitoring over the weekend or night. And also they will like to manage their own tags. They would like to switch to the metallic flavor just to get the basic infrastructure metrics. So go for the full stack or the do some monitoring. Also we have the data privacy issues that some things couldn’t be shared with [inaudible] for because they have a different relationship with store key with China, and also with with USA. We have secrets that couldn’t be shared with with all the operational entities and at the end of what the let’s say, what we did from the beginning is to, to make it with an invisible and give this disability on demand. And on top, you have the things that are managed by an agent like black books monitoring this synthetic monitor.

Carlos Munoz Robles (22:09):
I can send it to TPM point or this grocer click path. These are things that need to be managed also through our configuration as code approach. Also the, some disability that are just for, let’s say for the admin cluster. So far more than address instance an example is that you want to a more AWS account and you need to be a trusted admin, and we are trying to provide this visibility. So the customer can onboard their own AWS accounts without any human interaction. So then what we, or the technologist that we applied this computer science code, get ops automation in order to build this monitoring as a service. And now dive a little bit into, I mean, if you have any further questions, please don’t hesitate to ask and, or we can wait until the end. I mean, because at the end, I don’t want to talk for half an hour, but anyway, so configuration as code, I think this is one of the centers that defines a bigger, this approach, like treat your, or your application of it was a code to the configuration as if it was code.

Carlos Munoz Robles (23:20):
And this is what we are doing on the advantages are pretty much the same of the desired state that we predefined at the beginning, human readable, comprehensible and transparent version of all through Gates. And also that we it’s easy to automate and we have different targets. I mean, we, as a manager, I’m capable to, to take the configuration from all my customers, that the collapse I have to all the gift card ripples, and it can take immediately and I can create my own business KPI based in the changes that they are triggered in the gift for us. Also, we have the service owners, this is the visibility that they will get. We will go deeper into this topic. This is the, how the CIC ripple looks like, and they can take at a glance what they are doing in their own isolated segment. And for the level of senior nurses, this is also really helpful.

Carlos Munoz Robles (24:13):
They can onboard their uncle, one of his cluster with let’s say, in a self service fashion, and also the developers and operators can modify any of the parameters on demand. And I mean, as I said before, everything is linked to a work who can able to jump into the building. They will get the feedback in the, with me file and also an email and everything is handled in a minute or so. I mean we have acute Jenkins that we can talk about this later on in an open shift right now. But our idea is also to migrate this huge thinkers cluster into EKS in the future. Yeah, I mean, we country arrive and all, we had like 100 jobs running at the same time. Same time, sorry, without any issue, what else? Yeah. this is after broccoli a year working on this project.

Carlos Munoz Robles (25:05):
This is a summary of what people, all our developers are requesting on a regular basis. What do they need to monitor? This is when we talk about host, this is what I said before with our machines containers and even many friends here. I mean, this is we have some legacy, but I think that mapper will remain forever and ever, but I do want to get the full visibility you need to go main from within your digital monitoring ecosystem. From that infrastructure at the beginning, the basic server is like easy two instances, load balancers and such things. And, but now from time to time, we are getting complex you know, requests. And I really liked them because we are facing new challenges right now, services, databases, Citrix, net scaler for, for the application society, the basics availability response time, and also the, and such things.

Carlos Munoz Robles (25:58):
This is basically what we are getting on a daily basis. And now let’s talk about the, the personal configuration repo. This is at the end, how we split it, the configuration as code the beginning we give to our customer the chance to create alerting profiles and notifications. So it’s up to them to shape create a new notification in Slack and Microsoft teams and send an SMS voice call alert in case that they have a team on duty. Also they are capable to onboard their own AWS and a shared account in Italy in less than a minute, because at the end, we are talking about just four or five per meters. You can imagine we need credential from time to time. And even if you want to ship, let’s say some notification to your Slack channel. This you are on Mr.

Carlos Munoz Robles (26:57):
Being greeted because there is a kind of a token in the middle. We provide the microservice to include your credentials, and you’ll see in there in the future for the replacement, also another coordinate, this is really trendy. We got to like the request for monitoring 200 corner, this cluster over the last month. We provided the CIC also for the coordinators and care is getting more complex. And we will see why, because you have the other token and you have the certificate that needs to be provided into intermediate data that is called up-to-date. But again, more features like maintenance windows and one that I like maybe the most is the life cycle management. Right now we have 8,000 agents up and running. And of course, I don’t want to take the responsibility of like grading each one of them because we are talking about production assets.

Carlos Munoz Robles (27:52):
So we gave the responsibility to the end user, to the development team to trigger the upgrade of the [inaudible] running on their servers. So they, they have like a huge list in a demo file. They can go there, they can take the version, the one that we already supported, they switched the number and they will trigger the APA API called to update the one I run on that server. And this is also really cool. So we don’t need to handle, and, you know, these 8,000 servers on a regular basis, let’s say every, every month we need to upgrade has to be sure that we are running with the latest version. And of course the Remi file we will see later that we is mainly used for, for feedback purposes. And now let’s go to one of the examples, how we handled the CAC. So if you go to Dynatrace and you want to onboard your coordinate this cluster, because at the end, this approach could be used with any tool.

Carlos Munoz Robles (28:48):
We did it with Jenkins next to sonar cube with permeate field, with elastic stack three years ago. And now we are applying the same methodology in dynamic ways. And again, this is a vehicle you need to be a cluster admin to, to a modern equivalent of this cluster and fill out all these fields and also maybe switch on, switch off some of the parameters. And this is what we do. We create a kind of a template in a demo file, and we are, you know, matching the fields in a new barometer within our gamma five. And of course you can see here, we have alerts and that needs to be encrypted. And this is the main reason why we have a microservice to, to put all of our clients just in a single folder. And at the end, it just mapping, you know, within that you’ve seen the user interface in a demo file and distribute and give this visibility, sorry to the end user, instead of opening a support ticket and beeline operation team saying 200 such request

Carlos Munoz Robles (29:54):
Down, this is how just to show you how we are encrypting the credentials. We are creating a file on. We put this file in our credential folder. I mean, it’s pretty straight forward. And at the end with the credential that we have integrated with this template were able to onboard coordinate this class in a self-service fashion. And also we have like some slides where we show how we you know decrypt the, all the credential that we have in our CAC folder. Also how we transfer these credentials to the templates that has been filled out by our customers. But I think that we will not have time to go to this slides. So let’s go down to our CICB pipeline because I think this is also very important here. Let’s say how we’ll, we’ll see later this, like we have roughly 300 give her repositories.

Carlos Munoz Robles (31:00):
And whenever the customer triggers an update, let’s say they create an, a new auditing profile. As I said before, we have a web hook, then this will trigger the Jenkins. Still the Jenkins job will clone the report from the customer. It will clone also our, so what does a survey liberal where we have, you know, all the Ansible playbooks and yeah, Dan the Ansible playbook will trigger the necessary API calls to the API, then our previous to a more than equivalent this cluster. And in that case maybe this is maybe the most difficult example because the Dynatrace needs to verify the SSL SSL connection or to the API or Joko. And at this cluster, what it means is that you need to copy the certificate in what active advocates and the active gate for the ones that are not familiar with APM.

Carlos Munoz Robles (31:54):
Let’s say tools is a kind of a proxy. You can achieve the data that Alea from your servers to the plaster. And it’s kind of a proxy that is compressing the data that also is willing to make audio service, internet facing. So in this layer, you need to deploy the certificate. And in that case, we have also our infrastructure as a service repo, the, where we have our infrastructure as code, where we are updating, you know, all the certificates. And we deploy all our advocates on demand with blue-green deployment in our AWS accounts, and also in our on-prem data centers. And we are handling this.

Carlos Munoz Robles (32:38):
In one single, but that also is managed through configuration as code, even though you can still ups our handle in demo files instead of going directly to the UI. And thanks to [inaudible], this cluster were able to scale without an issue. So as of now, we didn’t get any complaints with, with on regards of DCI CD pipeline. And this is just a quick demo because I have to think of biasing power point. So this is let me check quickly. This is our production instance. We have several core NetEase clusters, and we can see here a couple of ones, one from Australia, another one from Europe, from HR pole tub. And that’s the problem. I don’t know, even the teams we are, for instance I didn’t expect to have such amount of folks go on a disk clusters in our instance today.

Carlos Munoz Robles (33:37):
And what I did is okay, this hour, the team, my HR portal. So I went to get cup. I filter by the name of this team. I went into the production environment and I realized that they went to the corner, this folder, and they board, they don’t go in this class though without any human interaction on our site. And I think that the this is really called, this is the only way to, to handle such demand with 67 different countries that will work to just all the different flavor from our work going and doing lightening solution to the folder. We can see that we have the credentials here, as we saw before they use our microservice to include the credentials. And also one thing that is really important is to provide the feedback too, so that it’s taking some time yeah. In the rim, we file, we provide the feedback from the game kitchen.

Carlos Munoz Robles (34:46):
So the customer, because at the end, we are triggering the Jenkins to on our site and they can check that in the last commit. There, there reload all our documentation on that. They have this, a new credential for the corner, this cluster, and at the end, this is what they will get the visibility of their own corner. This cluster, they are supplying, you know, five, four, six different fields. Then a few minutes, they are capable to analyze all the notes and in the metric, I’m going to give you some commercials about Dynatrace. But at the end you can take all your worker notes and in each work note, and you can say the load from the processes, from the containers, he’ll be at creating a baseline that they will help us out to be proactive because we will detect any misbehave in the future.

Carlos Munoz Robles (35:35):
Dan, at the end, if we go to the host and you will see that we have 4,000 or so in production right now, because we have three production and production and vitamins yeah, I mean, this is the only KPI that I can take on my site is to come here and say, okay, how many new synthetic monitoring inductions we have in place 218? And this is increasing on a daily basis. And as you can see here, this is usually handle again through computational scope, synthetic monitors, there is no need to open a ticket. They would handle their own configuration and settings. Coming back to the, to the presentation. I want to show that they understand formation is a yearning that doesn’t really AMS. And now what we would like to offer in the future is to migrate this event management, let’s say, box to service now, or to CAFCA.

Carlos Munoz Robles (36:30):
We need to run a POC. We would like to use also then address what are CSED pipelines. I think it’s really, really important to define quality Gates. We have the unit test. We have I dunno, integration test but we are not usually taken if our application is either not consuming more CPU or more memory than we, we expect compared to the previous version. So I think it’s, you know, in the gen chem, so in our CSTD pipeline to the fund equality, Kate’s also to define the suffocating actions in the future that is not here, but we are already working on AIDS and the CIS from our CMDB, and also create a huge data. Like we are going to be the owners of all the logs, all the data coming from our business applications. So all the big companies are data driven nowadays, and it will be great to create a data Lake for postmortem analysis the machine learning.

Carlos Munoz Robles (37:28):
And of course, now we are in touch with our security teams because there is a way to take the security vulnerabilities of your third party libraries on runtime. So I think this is also a way to avoid, you know, any data leak in the future. Yeah, this is all for today. If decimate this time to jump directly to the, to the Q and a but anyway, if you have any further questions and you want to type in each one of those topics afterwards, too, you know you can drop me an email and you can ping me on LinkedIn and we can follow up this uptick.

Jim Shilts (38:12):
We do have one question in the chat here. If anyone else has questions, feel free to drop me in the chat, or you can take yourself off of mute to ask as well. This one is re related to Dynatrace is Dynatrace and agent-based monitoring, or just an APM tool, the self-service portal and service. Now, where does it push the config to? Is it the Dynatrace agent or something else?

Carlos Munoz Robles (38:41):
I mean, at the [inaudible] solution, I mean, you need to, whenever you are talking about application performance monitoring, you need to, let’s say to install an agent on a unit to the server. I mean, you need to, let’s say, get intrusive to deploy this a snippet in the front 10 to get this distributed tracing from end to end or otherwise, whenever you want to take the soapies flow or the appeal path from Java application or your microservice architecture, you don’t have an agent, but I do want to get a proper result. And we service now. I mean, in ServiceNow, you have item, you have the service discovery, but this is usually basic, basic authentication. And it’s checking in just a file from your table servers, what I mean for any web server. So it’s okay. It’s good enough that you can get some correlation on traceability, but it then in that case just need Dynatrace because at the end of the data we are getting from the agencies time, and this is the only chance to in this world that, you know the containers are going up and down on a regular basis to enhance that in grades, I would stay in DB James [inaudible].

Jim Shilts (40:02):
I do that all the time. Here’s it a comment in here too? Just from, from Pascal? It says, great, great work great achievements, congrats, like your seek for generosity.

Carlos Munoz Robles (40:16):
I mean, at the end, as I said before, you can copy paste the solution to any of your tools. We did it with other ones like Jenkins. The Brenda, now I am living this topic, but with Jenkins, we were able to deploy 300 or so instances of Jenkins within Allianz following the same approach. So service portal and development team go there, select Jenkins, click, click, click, we have everything fully automated with employer, new instance in our open shift cluster, they get to give her ripple to define their jobs and the settings from the, their own instance, like blogging, some such things and everything in less than five minutes. And you know, that it then can, even if it’s an a well-known tool, it takes time, you know, maybe you don’t have the skills in your team or manager and have the manpower to keep the, the instance up and running. So I think that this is a good approach that you can use for, for any tool. Yeah.

Jim Shilts (41:13):
Here’s another question here. Captain definitely cool. But do you believe it’s ready for production?

Carlos Munoz Robles (41:48):
Yeah, I mean we are brave, so we will give that we will ask for forgiveness afterwards, and this is when it don’t have fun. I think this is important. Otherwise, other look at a stack in this old school way of handling such things that, I mean, this is also a good way to keep your team fully motivated, and then they need to get me a challenge as household work group with a bunch of solutions and the traces have ended at one. But yeah, I mean, it’s the only way to provide an additional value to the company is we would like to create an impact and it’s working quite well. And we had a struggling from time to time because you need to change the mindset of some people, but I, I, they, they, they stole the benefits of just in this it works approach fostering this automation to the rest of the teams this resilience architecture. Yeah. We’ll give it a try with Cape town as well.

Jim Shilts (42:51):
All right. Another question. And thank you Pascal for that question. And this one from AIG, are you using the operator deployment model when you trigger the Dynatrace instrumentation for Kubernetes via Jenkins?

Carlos Munoz Robles (43:02):
We are just in the operator. This is the one. There are, you have three different flavors. And the highly recommended by my Dynatrace and also from our architects was to just the operator. I mean, at the end, they will get the deployed this in the demo they want set, and you will get the visibility immediately. The only tricky power for the ones that are just in this labor. If you have, let’s say infrastructure modem, full stack mode, infrastructure is cheaper because you will get the basic metrics, but I do want to go farther into the application layer. There will be more money for the licenses. So what we are doing to don’t switch the whole cluster to fill a start, can pay a huge amount of licenses. We are deploying and container in, in the, in the pod that is running the [inaudible] for instance, and this reduce a lot of the consumption of, of licenses. Like we have the operator for the infrastructure mode, and whenever we want to go deeper into, into the business application layer with a podium in a container.

END

“Self-Service Monitoring through versioned Infrastructure and Configuration as Code”

Contact