On January 21st, 2021, Nicolas Chaillan, Chief Software Officer at the US Air Force joined the weekly North American DevOps Group (https://nadog.com/events/)\webcast to share “How the Department of Defense Moved to DevSecOps at Scale with Kubernetes.”
We have included both the full webcast of the events as well as the transcript of Mr Chaillan’s talk. You can also find slides from this and other USAF talks here: https://software.af.mil/dsop/documents
Jim Shilts: Welcome. Welcome to NADOG, North American DevOps Group, DevOps & Hops, the virtual edition. Unfortunately, I don’t get to buy drinks for everyone tonight, but as soon as we are able to start doing these in person again, of course drinks and food will all be on me again. I hope you brought your own drink, or whatever you’d like, to just kick back and relax with us.
Jim Shilts: Today is the 21st of January 2021. My name is Jim Shilts. I’m the founder of NADOG and I want to personally thank all of you for being here. This kind of interaction, group discussion, sharing like Nick’s doing with us today, this is really, I think, one of the most valuable things that we can do. It’s a bit more challenging to do it virtually, so we’re really striving to hopefully bring that home to you guys here through Zoom.
Jim Shilts: Nick Chaillan is the Chief Software Officer at the US Air Force, and he’s going to talk to us about how the Department of Defense moved to DevSecOps at enterprise scale thanks to Kubernetes., everyone’s favorite tool.
Jim Shilts: Let’s jump to our featured talk, Nicolas Chaillan, Chief Software Officer of the US Air Force. Nick, how are you doing? Thanks for coming today.
Nicolas Chaillan: Thanks. Thanks for having me. I’m excited.
Nicolas Chaillan: All right. Well, let me share a little bit about the journey of the Department of Defense to DevSecOps. Some of you, I already know on the chat, so I guess you have heard me before, so that must get a little bit boring. But regardless, for people that have not heard this before, hopefully that can be entertaining, beside my accent. And, I am American though, so I guess that’s the difference. Just kidding.
Nicolas Chaillan: Anyway, if you want to take a look at the software, the new website, we have a lot of great information there. We have all the architecture documents, the videos, the training, all the great unbiased content coming from industry, coming from the Linux Foundation, the Cloud Native Computing Foundation. All the source code of what we build is there as well.
Nicolas Chaillan: For a team that are trying to adopt DevSecOps at scale in large organizations, this is a great option, so take a look at that. The link is on the slide software, though they have [inaudible 00:08:30] particularly the section called the DSOP documents, that’s where you’re going to find a lot of the links and the videos and a lot of content coming from different speakers.
Nicolas Chaillan: Now, when it comes to DevSecOps, we created the DoD enterprise DevSecOps Initiative when I started back in DoD, back in August 2018, I think… yeah, I forgot, two years. That’s it, 2018. It’s a joint team between A&S, DoD, CIO and all of the services. We have everybody working together to bring to life DevSecOps department, creating enterprise-wide capabilities with Cloud One and Platform One and the goal is really to bring that timeliness, modularity and enabling reuse, so teams can move at a pace [inaudible 00:09:24] while having that baked-in security for their computer engineering work.
Nicolas Chaillan: We try to think of software as something that’s never done and always evolving and to do that, you have to have that fast prototyping feedback loop with your end users no matter where they are and what kind of system you’re building. That pace and that cycle is foundational to the success of any organization in 2021. We picked Kubernetes to avoid vendor lock-in. Kubernetes is part of the Cloud Native Computing Foundation and we use containers as Lego blocks. These Lego blocks are reusable, can be centrally hardened and secured and assessed and updated so they can be deployed on any environment, whether it’s a Cloud provider or a classified Cloud, on a jet, on a bomber, on a space system, that gives a lot of flexibility to then be able to build systems, and we’ll talk about that today.
Nicolas Chaillan: We created what we call the Iron Bank, which is the centralized artifact repo of containers for the DoD, so we can centrally assess and accredit containers coming from industry, but also coming from open-source projects or coming from the DoD. We have now 300 containers approved on the Iron Bank. We also bring a zero trust behavioral detection cyber stack using sidecar container, we’ll talk about that today a little bit and if you want to know more, again, on the website we have deep dives in our cyber stack. If you want to learn more about Zero Trust and how we do behavior detection and reduce our attack surface, there’s a whole video on that on the website.
Nicolas Chaillan: We partnered with a Joint AI Center, excuse me, JAIC, part of the DoD CIO’s office to bring AI/Machine deep learning stack, and we partnered also with OSD [inaudible 00:11:38] to bring the digital engineering as a service stack all on top of Platform One.
Nicolas Chaillan: We created two teams that I forgot to talk about, Cloud One and Platform One. Cloud One is a cloud office using Azure and Amazon and also partnering with the IC when it comes to the classified clouds and fences for stack specifications. We have about a 100,000 people we have to train this year. So we partnered with Linux Foundation, the Cloud Native Computing Foundation, [inaudible 00:12:16] to bring a learning hub that’s bringing unbiased content from industry and a sandbox to put it to practice, so the teams are not learning in a vacuum. We also created the concept of a continuous authority to operate, continuous ATO, which continuously accredit software and make it ready for production. Multiple times a day… [inaudible 00:12:42] Platform One about 21 times a day. So we’ll talk about that as well. So why does a DevSecOps matter?
Nicolas Chaillan: So let’s go back to the slide so we can see the websites. So those are the different links for the videos and all that stuff. So take a screenshot of those. It’s going to be in the recording and the slide is on the website anyway, so you can find a slide deck on the website too. Talked about all of the stuff we’re doing, but really that’s just me reading this slide and then why does DevSecOps matter? So the timeliness piece we talked about, in term of the critical aspects to be able to keep up with the pace of relevance for cyber to be able to compete and of course fail fast, learn fast, but don’t fail twice for the same reason. And then we have of course really what we think is the only way to build software in 2021 now, at scale.
Nicolas Chaillan: We save about a hundred years of time already only after one year using DevSecOps. So it’s a magnificent, incredible time savings across the department. A hundred year compounded with 37 programs at the time. We have 60 programs now, so much more saving. About 12 to 18 months of time, thanks to the continuous ATO, and about four to 12 months of time, thanks to the continuous feedback loop that we can have with the end user, which is pretty significant as well.
Nicolas Chaillan: We also save a lot of money, two big programs about 12.5 million bucks per year, per program, so that’s significant. And of course, we’ll talk about benefit of the continuous ATO, and the baked-in security that we have. Now, when you look at some of the numbers, we had 106 times faster lead time from development to deployment, 208 more frequent code deployments, 7 times less change failure rate, 22% time less unplanned work or rework, which means better quality of work to begin with and better feedback loop from your end user. A 50% less time remediating cyber issues. And one of the most incredible one is 2,600 times faster recovery, so what we call mean time to recover. We also reduced costs of development by 40% on average, with 44% more time focused on new capability versus maintaining legacy code. And if that wasn’t enough, 2.2 times employees are more likely to recommend the organization, so improved the employee morale and retention and attraction of new talent as well.
Nicolas Chaillan: All right. So when you look at the software ecosystem, Cloud One and Platform One are becoming these central components, that will be feeding capabilities to the rest of the teams, both for the defense industrial base partners with S&T. Other agencies. We have 12 agencies now using Platform One, all the stuff from our factory that you may have heard of from the [inaudible 00:16:28], to the Space Camp, Kobayashi Maru, to the Level Up and so on.
Nicolas Chaillan: And 43, or now probably even more than that, of the key weapon systems, ABMS, F-35, GBSD, AEGIS, JAIC, so really some of the most critical, if not the most critical stack on the planet, all working with Platform One. Data cool map where you can see we’re pretty much doing a little bit of everything, from space to business systems, to weapons systems and jets and bombers, so that’s pretty cool. And really some of the key numbers that I think people care about, and you’ll see Platform One is designed to be very flexible, not have a one-size-fits-all implementation. We have what we call the Party Bus, which is the multi-tenant environment.
Nicolas Chaillan: Effectively it’s a shared DevSecOps stack across teams where we have 2,000 developers, 1,700 microservices, I think we’ve manned 2,100 already, 24 applications in production and over 150 now, product teams on the Party Bus, all of that within six months. And we have over 315 approved containers on Iron Bank now. We do 21 comets of code per day and [inaudible 00:17:54] of code per day with under two-day for lead time to deploy new features and less than 15 minutes time to restore issues and less than 5% change [inaudible 00:18:08]. That’s pretty amazing numbers for the government, I guess. Even for industry I would argue, so that’s pretty exciting to see.
Nicolas Chaillan: When you look at how we designed Platform One and why it’s so important to have a central team to help organizations move to DevSecOps, if you want to be competing and relevant this year and the next, you probably need something like this in your organization. We have built free services and paid services for the DoD teams so we can scale. The key is to avoid reinventing the wheel, but at the same time in bringing some of these kind of enterprise capabilities, but at the same time you want to avoid a one-size-fits-all approach and bring options. Not too many options, but not too few options either.
Nicolas Chaillan: We have Repo One which is the central Repo of the source code. That’s where we put all the source code of containers, all the information that we do with Infrastructure as Code. Then we have the Iron Bank which is the binary side of that. That gives you the containers that are built.
Nicolas Chaillan: Again, now we have over 300 containers, so the numbers on that are dated there. We have both commercial organizations and open-sourced products and government code on that Repo. Companies that are trained to do business with DoD can go and get their containers accredited. That’s accredited DoD-wide, all the way to [inaudible 00:19:46]classification level, so the highest classification levels of the DoD.
Nicolas Chaillan: For an organization that’s trying to sell software to DoD, that’s probably the easiest way, is to get your containers approved on the Iron Bank. On the website you’re going to see the Container Hardening Guide that’s explaining and showing people how to do that. We can often get containers accredited between two to four weeks. It’s pretty fast.
Nicolas Chaillan: We also centrally asses and accredit Kubernetes distribution, so we have multiple options. We have OpenShift, we have Rancher, we have D2IQ, we have [inaudible 00:20:24] cloud options. That gives a lot of diversity and options to the DoD to pick the right Kubernetes distribution that works for their use. We have a matrix to help teams pick the right distribution based on the different criteria, so check it out. It’s also on the website. That matrix really compares features and differences between the different companies and their products. That’s always interesting.
Nicolas Chaillan: Like I said, we have two main services. One is the Party Bus, which is part of the ABMS [inaudible 00:21:03] to work. Party Bus is multi-tenants. It’s DevSecOps environment that we provide on multiple classification levels and we pick the tools and the team just becomes developers and use the tools.
Nicolas Chaillan: While, the Big Bang instead is more like a turn key [inaudible 00:21:22] of Platform One and teams can take the Big Bang and deploy it anywhere they want. What’s interesting is we start to see large organizations and even start-ups take the code of Platform One and Big Bang and deploy it on their premises or their [inaudible 00:21:38] effectively using Platform One to do DevSecOps for their internal development as well, because they want to get some of our Zero Trust and Cyber Stack and they want to make sure they are lined with a DoD stack. We even saw some big [inaudible 00:21:59] telling us that they are going to be using Big Bang for all of their developments effectively making this their DevSecOps internal stack, which is exciting. Since we opened all that source code, they can contribute and have developers and be part of that engagement as well and be able to make changes and improve the code without [inaudible 00:22:30] it so we are stronger and faster and better together as a team. We have multiple organizations now in large [inaudible 00:22:37] starting to completely bet on Platform One, Big Bang to do their DevSecOps engagements.
Nicolas Chaillan: We also created the first Cloud Native Access Point, which is a Zero Trust stack that helps teams get access to club providers and on premise work loads by enforcing the device date and based on the user identity what they should have access to. It’s checking the device of the user. That could be a [inaudible 00:23:09] device, it could be laptop, a mobile. Based on the device date and what device they use, they are allowed access to different resources based on need-to-know and based on who they are. It’s compounding the risk of the device with the user identity and based on these [inaudible 00:23:28] control enforcement they are white list accessed to cloud resources. Effectively implementing the Zero Trust and bringing single sign-on at scale. We have a whole architecture on that, so if people want to see how we build that and that whole Zero Trust stack, that’s how we’re going to be accessing our DoD network.
Nicolas Chaillan: Training is also foundational, right? We talked about the pace of IT and keeping up to be relevant. Well, you can’t do that if your people are behind, are not able to spend time to learn. We give an hour a day to our people to keep up with what’s going on and watch videos and training and things like that. We created this portal where we have, like I said, partnered with Linux Foundation, CNCF, and writing the books and we give access to that to our people so they can keep up with what’s going on. We also do workshops one-day, three-day, even a full two months [inaudible 00:24:39] that we [inaudible 00:24:40] people of Platform One inside of a team to start doing their SecOps and producing, within two months, their first product. We have a lot of cool stuff that we do to make sure that people can be continuously keeping up and learning and making sure that we don’t become stale and get behind.
Nicolas Chaillan: We also created Contract Vehicles for their SecOps, so some of the companies on the phone are already part of the contract vehicles. We have three main vehicles. We call them the Basic Ordering Agreements for DevSecOps, the BOAs. They have one for cloud services, there’s one for licenses, there’s one for talents and services and effectively that enables DoD programs to order Cloud services, licenses or tools and people, talents. That could be DevSecOps engineer, that could be a [inaudible 00:25:39] scrub master, that could be a developer, that could be support teams. All within that same vehicle. We can have orders under 30 days, so that’s exciting as well.
Nicolas Chaillan: We ended up cutting DevSecOps into layers so we can swap layers and be more agile. Really the goal was to be able to centrally assess and accredit layers, so separating infrastructure and platform from CI/CD and application layer. Platform, like I said, we use Kubernetes, the CI/CD layer, Continuous Integration/Continuous Delivery, is completely containerized. Everything is containerized. I will talk about that.
Nicolas Chaillan: Then we have what we call the Status Mesh which is doing the enforcement of Zero Trust for the traffic between containers, East/West Traffic as we call it. What that gives us is the inheritance of about 90% of the [inaudible 00:26:38] controls. The application is only left with a very small number of cyber controls to worry about because the rest is actually completely managed by the platform and by the continuous monitoring stack and the platform layer. That gives us a lot of flexibility for development teams to only focus on the delta, not to reinvent the wheel or all that cybersecurity stack.
Nicolas Chaillan: Effectively, when you look at the DevSecOps spike line you have all these phases and tools. If you were to run these as literal machines, and you didn’t use containers, they would slowly drift and become stale. You would have to manually update a bunch of [inaudible 00:27:22] machines, particularly when you have development testing, staging, production and environment and multiple classification levels, you end up with 20, 30, 40 of these per program. So you don’t want to have drifts and have environments become stale compared to another one. By using continuous we can centrally update and accredit continuous and the tools will self-update and self-heal. If they crash, they give us a lot of flexibility to centrally manage and update all the Lego blocks as continuous.
Nicolas Chaillan: Why did we pick Kubernetes? We talk about the vendor locking aspect of abstraction and making sure we don’t get locked into a core provider or to a vendor. Containers are immutable, so we can centrally assess them and accredit them. They can run anywhere. That gives a lot of flexibility compared to a virtual machine that’s only for one Cloud or one specific platform. Containers can really run anywhere.
Nicolas Chaillan: Also, Kubernetes brings us a lot great benefits beside the abstraction aspect. We have resiliency because it self-heals containers that crash, so they can be restarted automatically. We have that concept of sidecar containers, so let me explain a little bit about what a sidecar container is for people that don’t know.
Nicolas Chaillan: A sidecar container is a small container that runs alongside each container. Not inside, but alongside, so it’s injected by Kubernetes and it’s going to be present alongside every container to be able to do a lot of different things that are, again, changing for cyber. It could act as a reverse proxy to be able to prevent traffic from flowing and tracking what’s going on. It also can act as way to centralize logs and [inaudible 00:29:31] to flow all that in to a central log system. We can inject, effectively, cyber tools, as a sidecar container, so they can see what’s going on, they can monitor all the traffic, they can prevent access.
Nicolas Chaillan: There’s a lot of different things you could be doing. That’s how we inject our cyber stack into all products, but effectively it’s decoupled from the containers. If I have to update my cyber stack, I don’t have to reach out to each team and say, “Hey, you have to go and update that agent or thing.” We can update the cyber stack without having to update the containers, so it’s decoupled and gives us a lot of flexibility when it comes to adding or removing or dating things.
Nicolas Chaillan: It’s also, obviously, giving us that adaptability, continuous Lego blocks that can be swapped with no down-time and giving us that modern writing so we can do AB testing and different things like that, but also we use the concept of infrastructure as code in GitOps to also make the deployment of DevSecOps, so we can [inaudible 00:30:44] the DevSecOps stack with a push-button deployment. You push the button, you walk away and you come back and you have a full DevSecOps stack.
Nicolas Chaillan: You can tear it down every night and bring it back up every morning if you want and have a whole environment completely automated, which kind of mitigates some cyber risk by being able to go back to immutable state and using what we call “moving target defense,” which is, if you tear it down every night, if the bad actor got in, he loses everything he’s done and he has to restart back on zero in the morning, so it gives a lot of flexibility there. It also enforces, that you don’t drift between environments, between classification levels. If everything is automated and the same, effectively you know that your different environments will not be drifting from each other. That’s foundational if you want to do work at different classification levels.
Nicolas Chaillan: Kubernetes also auto scales. If you need more computer memory, it’s going to scale up and down, so it’s going to save you money on the long run in computer memory consumption.
Nicolas Chaillan: We talked a little bit about Infrastructure as Code. Again, that’s the automation of the entire stack in code, so you have no drifts between environments, its immutable, it’s replicable, it’s automated. Effectively, you remove human from production environments that would use the attack surface where you can disable [inaudible 00:32:12] be in the system, so that also reduce [inaudible 00:32:16] threat.
Nicolas Chaillan: The evolution of Infrastructure as Code is GitOps. GitOps really gives us that automation where everything is code. Your networking changes, configuration changes, even keys and secrets and passwords can be encrypted in Git Repo. That gives you that consistent deployment and roll back if you have a problem. For [inaudible 00:32:41] recovery it is awesome, because you just have to back up your Git Repo and your databases and you have a full [inaudible 00:32:48] recovery of your stack, because everything is in code, so everything is in your Git Repo beside the data. So, you back up the databases, you back up your Git Repo, you’re good to go.
Nicolas Chaillan: That gives you, of course, that compliance and audibility so you can check, at any time, your desired state in your Git Repo. We pull from Git, so Kubernetes pull from Git and [inaudible 00:33:14] push. Lot of tools out there in CI/CD do a push, which is really a big no-no. Effectively, the CI/CD tool, in many cases, have the keys to your production stack, so if someone gets into your CI/CD stack, they get into your production system as well. By doing a pull, the production stack is pulling from Git and it’s not… The Kubernetes stack has the keys of Git and not the other way around. That also removes ports and connectivity risks between the CI/CD tool and the Kubernetes cluster. We use Argo CD and Flux which are two open-source NCF projects to do our get-ups, deployments. Like I said, it’s a pull every minute from Git to see if anything changed and implement that change if it did.
Nicolas Chaillan: It’s also a great way to do your change management enforcement and configuration management enforcement by having multiple set of eyes on code. For us, for example, we would say, well, if you have a critical change in the source code, you could have two or three set of eyes to go and approve it. Or even four if you need. Effectively mitigating any kind of threat and doing a very similar process as the change management review except it’s fast and it’s something that can be done multiple times a day through a merge request in Git. We obviously use that to enforce the fact that not one user should be able to make a change without, at least, another set of eyes to review the change.
Nicolas Chaillan: The DoD is moving to this Continuous Authorization concept where we used to do something stuck in time once every three to five year and it would be a snap-shot of risk. Often wrong and often not updated for a long time. We want to move to a realtime [inaudible 00:35:16] of risk and having the ability to release software multiple times a day. SpaceX does 17,000 of [inaudible 00:35:25] software a day. They can update the software of the rocket the day before the launch. They can do three [inaudible 00:35:33] testing of the software today so they can always verify and validate and test the software on the actual hardware to make sure they didn’t break anything. We created this kind of Continuous Authorization approach which has three pillars. One is the platform, one is the process and one is the team. Both the team that runs the platform and the team that use the platform to build software. We separate the DevSecOps platform team and the team that is writing software using it.
Nicolas Chaillan: Effectively we created three pillars. The platform, which is what often people call the “Software Factory,” which is really a DevSecOps pipeline, which for us has to be compliant with our reference design that we published back in August 2019, which has a few key gates and also supposed to be using the containers from the Iron Bank and the sidecar continuous stack that we talked about, which brings up a few things.
Nicolas Chaillan: One is pushing logs, using [inaudible 00:36:44] centralized logs and [inaudible 00:36:47]. The other one is we use the Service Mesh to do Zero Trust for Container A to be able to Container B. It has to be white listed. That’s effectively a Zero Trust stack for all the traffic between containers. Then we have a behavior detection aspect where we have a tool that continuously monitor the stack and if the container starts doing something it’s never done before and drifts in behavior, we kill the container proactively. For us, we require that the teams to be able to get a continuous ATO run on the platform that has everything we just talked about. Platform One gives that day one, so people using Platform One will have that day one.
Nicolas Chaillan: Then the second Pillar is the process. The process looks into your gates in your CI/CD pipeline. We have three main gates. Change Management Gates, Testing Gates and Cyber Gates. Change Management Gates look into how many set of eyes you have for specific changes. Testing Gates look at your test coverage for unit testing, integration testing, regression testing and [inaudible 00:38:02] testing and then Cyber Gates look into static [inaudible 00:38:06] bit of material scanning. [inaudible 00:38:09] of dependencies and continuous degree of scanning looking at [inaudible 00:38:13]. Effectively, we set thresholds of dates and if you pass the thresholds, your software can be released into staging and then production.
Nicolas Chaillan: Then we have Continuous Monitoring Gates where we are continuously monitoring the behavior and having alerting in place to be alerted if there is a potential breach. That continuous monitoring and alerting is critical of course particularly for [inaudible 00:38:41] and looking at behavior detection.
Nicolas Chaillan: Then we have training pieces for the team. The last pillar. To train the platform team for both cyber training and DevSecOps training. [inaudible 00:38:55] metrics and understanding metrics. Then we also train, of course, the development team to use the platform to build software. We have training materials that we have at Platform One to help the teams be onboarded on Platform One.
Nicolas Chaillan: Effectively, when people use Platform One they have the first pillar, day one. The process is there, but we work with their authorizing official to define the threshold of the gates. The number of eyes on code. The threshold to pass. The static dynamic [inaudible 00:39:28] tools and so on. That’s very easy. That’s something that can be as fast as a few days to agree on the thresholds. Then the team, we have a training onboarding process to, of course, check clearances and background checks, but also making sure they get the training they need, giving access to our continuous learning hub as well.
Nicolas Chaillan: Effectively, we’re moving from that snapshot in time to this continuous risk posture in realtime with that [inaudible 00:40:03] looking at reciprocity, inheritance, having that CI/CD threshold to trigger events that could then involve the cyber team and maybe the authorizing official to approve a white list specific finding that are above the threshold if they need to white list it. Enforce it. We use automation to create a consistent and secure and repeatable [inaudible 00:40:33] of that stack and effectively we are making sure the software product has been through that automated risk determination process using really complete automation and removing, as often as possible, human in the loop to make sure that we don’t increase the attack surface.
Nicolas Chaillan: Again, I wanted to share a few links, the same links we shared before, but for people that missed it, we have, again, a lot of old deep dives on the website and videos of more [inaudible 00:41:12] and stuff on there, so you can take a look at some of the stuff we’ve been doing. If you have more questions on cyber, on the Cloud Native Access Point, on Zero Trust, on the different gates, we have a whole brief on continuous ATO that’s two hours, so I can’t do it justice here. It’s a lot of great stuff. We have the source code Repos, so you can go and check the source code of everything. If you want to deploy Big Bang on your Cloud on your own premise, environment, we have the source code of Big Bang on Repo One, so you can deploy yourself. It has a full deck set-up stack on demand. Of course, we’ll take questions now.
Jim Shilts: I’ve got a couple of questions here that came in on the chat. Want me to go ahead and just read those to you, Nick?
Nicolas Chaillan: Yeah, please. Yeah.
Jim Shilts: The first one is from Peter. He asks, “How are you handling state with containers. If you are able to containerize the database to keep the container immutable, the storage is externalized. What is your general method for managing state with containers?”
Nicolas Chaillan: Yeah. That has drastically improved in the last two years. Most databases now have what we call what we call operators, a community of operators to manage the databases and upgrading and scaling and [inaudible 00:42:33] scaling of databases or pretty much database out there and of course the container itself is [inaudible 00:42:42] less but then the data is [inaudible 00:42:44] on Kubernetes, so effectively you could host it on clouds or on premise or pretty much anywhere you want.
Nicolas Chaillan: It decouples the run time from the data storage piece and of course each database is different and so based on the database now we have operators that will manage for you the lifespan and the updating and upgrading of these containers. On the Iron Bank we also have databases and so, I think we have 23 databases now, pretty much touching everything from rational database, [inaudible 00:43:20] big data [inaudible 00:43:22] databases and you name it. By using operators the automation is highly helpful to mitigate some of the manual aspect of upgrading and managing the life of the containers. It does this for you. You don’t need to worry about it. That has drastically improved from two years ago where running [inaudible 00:43:50] containers on Kubernetes was a problem. It is not a problem anymore. In fact, everything we do is on Kubernetes now. Hopefully that helps. If not, we can deep dive if you have more details you want to know.
Jim Shilts: Yeah. That’s a good question, Peter. I have heard that question a lot. The next one is from Jason, “How does the sealed secrets in Git work? My first thought was a private key file with some kind of encryption, but there is a partial plain text attack for that.”
Nicolas Chaillan: Yeah. There are a few options now that are becoming kind of commercial products on Git. There is something actually called sealed secrets that is a product and that’s pretty much doing… Obviously you keep the private key and no none could decrypt the contents without the key. And of course, they build that key to back up somewhere, still something left to worry about. We tie this back to DoD to… We created Platform One PKI stack using volt. So we have a HSM with volt and we store our keys [inaudible 00:45:04] for encryption, email signing and all that, encryption and also code signing on Git and also to sign the containers as well.
Nicolas Chaillan: Effectively, those keys can be stored and then the servers that will be implementing the code to get the secrets out will need that key to be able to decrypt the contents. Effectively, there are a few options nowadays. Steal secret safe [inaudible 00:45:34] is another one. If you Google [inaudible 00:45:37] Steal Secret, GitOps Secrets, whatever, there is a few of them. They all have different pros and cons to look into, but they all kind of use the same principle of encryption. Effectively, without the key you can see the passwords or whatever it is and you get there this way. That’s kind of the easiest way to do this.
Jim Shilts: I don’t see anymore questions in the chat. Who else has questions for Nicolas? We can go ahead and open up all of the lines here for you if you want to take yourself off of mute and ask Nicolas directly you can do that. Or if you type fast you can put it into the chat. I’m still looking at that as well.
Jim Shilts: A couple of people have asked you about the slides. Nicolas did put a link in the chat with all the slides from this talk as well.
Peter: Can I follow up with a question-
Nicolas Chaillan: Yeah. There are longer versions as well with like, 60 slides.
Jim Shilts: Cool. Peter, did you have a follow-up question?
Peter: Yes. My follow-up is when you talk about, just as the example of a database, because it’s simple to talk about, but when you externalize the storage, how do you keep those things in sync? How do you cluster that data? How do you manage that state so that you can have a redundancy and therefore the security to make sure that your storage plane is not under threat, because that is where the real jewel is.
Nicolas Chaillan: That’s a great question. Every database is going to be a little bit different there. Obviously some have built in [inaudible 00:47:11] ability concepts to act as a, I guess we don’t call it “master slave” anymore, because it’s a bad word anymore, we don’t use “slave,” but I don’t know what you’re supposed to say anymore, but that kind of a model of making sure you have redundancy and a [inaudible 00:47:29] across regents or across hosting platforms. Each database is different and the beauty of operators, if you look at the big clusters like [Kafka 00:47:40], or [OMUGDIBI 00:47:42] or My Sequel or whatever, they all have, in the operator, a way to go into the code and scale. You can tell it to do it and that’s a beauty of that automation, you can tell it to do it across a cloud regents, across Clouds, even if you want to say that they have… I don’t know why my camera is… Somehow it’s picking up the wrong light. I look like a ghost, so I’m a ghost now.
Nicolas Chaillan: Each operator has its capability to help you automate that process of [inaudible 00:48:20] and resiliency and scale, so it’s going to depend on each database that you use. There is a different process on My Sequel versus [Kafka 00:48:27]. [Kafka 00:48:27] where it’s going to do some training and spread the data across multiple servers. My Sequel has this concept of master/slave thing, which I don’t know you call that now, which is kind of the same principle. Primary, singularly, although secondary means there is only one second and slave could have 20, but there is always… Yeah here we go. Leader or followers, that’s perfect I like that. Lead or follow. Whatever the name is, that model… Each operator in [inaudible 00:49:00] do this job for you and so there’s a lot of development work to create these operators from companies to help automate that process.
Nicolas Chaillan: ReCast, for example, is the operator to My Sequel to be able to completely automate the management in [inaudible 00:49:17] and scale of my My Sequel. Same thing for [inaudible 00:49:21] same thing for whatever database of choice that you want to use. They all have different ways of doing it though. Honestly, if you want to have multiple clouds or you want to be a hybrid, we do that all the time. For example, we have a [Kafka 00:49:37] cluster that can run on the Cloud and then be completely replicated on premise and at the edge and go offline and then connect back in and synchronize again. You have [inaudible 00:49:49] or intelligent connectivity. So, a jet could be flying and then be offline and then come back and sync back the data, so we have a lot of backing up stuff. All through that automation. And that’s where, very often, you go to paid products to buy the operators, because often it is part of a paid service that they sell. Very few open source products have the advanced operators and it’s very, you’re going to find… [inaudible 00:50:20], for example, you would find us under, it’s actually a [inaudible 00:50:24], but it’s under the license of [inaudible 00:50:26] the company for ESK, so there are plenty of options.
Anonymnous: I have a question. It’s really not on the technical side by any means, but how do you approach the cultural aspect and the operational capabilities to meet amateurs at that level of threat, right? I know when we consult we get a lot into mindful risk acceptance, yes [inaudible 00:50:53] and those types of things like threat modeling attack vectors. On the CI side, getting myriad [inaudible 00:51:03] based on test automation in order to meet that adversary at that level. How much time do you spend on that as opposed to the technical pieces?
Nicolas Chaillan: You know, it’s always interesting to me, we certainly do spend a lot of time on this, but at the same time, I always get the feeling when you really start talking about some of the top weapons systems of the DoD, you pretty much have every threat, right? I just don’t think… It’s like, are we really going to do the exercise of planning for every little [inaudible 00:51:32] when we know all of them apply? It’s not like you’re a start-up or a small company and you have a subset of threats. For us, it’s going to be everything on the planet. I can’t think of a single thing that would not apply to our workload. I find it sometimes to be, not a waste of time, but more like a… There for just for the heck of being there. It’s something we have to do, but I have yet to see a [inaudible 00:52:01] case where that brought me an idea that I didn’t think about or a new attack vector that we didn’t think about, particularly when you start thinking Zero Trust. The foundation of Zero Trust is not to trust anything so by definition, you’re already doing whatever you can to reduce your attack surface.
Nicolas Chaillan: Of course, I think the Solar Winds breach was very significant and there was a few interviews of me on Solar Winds and the impact of Solar Winds. I think it’s probably the most impactful ever and people barely understand or scratch the surface of how bad this was and how advanced the bad actors can be when they get, in fact, into your crown jewel which is your CI/CD pipeline which is what happened here when they managed to be able to inject malicious code into the CI/CD pipeline and releasing in the [inaudible 00:52:57] servers that malicious code to make it look like it was, in fact, coming from Solar Winds servers and it was. It is just an interesting process. I just like to always go back to the basics. If you really look at your attack surface and you look at the people and process you can really highly mitigate a lot of things.
Anonymous: I’d like to ask a question. In the specific phase of packaging up an application in a modern distributed application you may have a lot of containers or even serverless components or VMs, a kind of a mixture of subsystems, and once you got through your pipeline and you’re packaging that up and thinking about deploying it to production, from what I’ve read so far, it seems like you all are tagging all the artifacts and thinking of that application as a whole. Have you all considered and maybe optimized to the point where you’re deploying individual subsystem components, because there is some efficiency there as well. Your thoughts on that would be interesting.
Nicolas Chaillan: Yeah. In fact, that’s a big part of what we have to do now. We’re moving to microservices and reusable code and trying to share pieces across systems. If you look at SpaceX, again, they have nine platforms. They share 80% of code across platforms. Only 20% is customized across platforms. When you compare to Air Force, we have barely 5% of code at best reused between F-35, F-16, F-32, and whatever else, so very few sharing of code. Moving to microservices now, which is what we’re doing, we’re targeting more 80% code sharing across teams. Effectively, making us faster and more agile and we can scale confidence and these Lego blocks become swapable and a sensor, all the way down to the sensor, could be a separate container so you can swap things around on the jet and not have to completely re-write [inaudible 00:55:16] and move away from that one-way effect view of the system. We are completely doing that. In fact, that is why we have 2,000 microservices we built in six months, so 2,000 Lego blocks, if you will, already done.
Jim Shilts: Hey Nick, I know you’ve got a hard stop here in about a minute. There are a few more questions that came into the chat. Would it be okay if, maybe, I emailed these to you and then I can send the responses back to everyone in the follow-up email that we send out?
Nicolas Chaillan: Yeah. Let’s do one more and then I’ll go. I’ll give you one more. I see there is a question on DNS. “Were there any issues with Kubernetes and DNS hook-ups and scale?”
Nicolas Chaillan: The interesting thing is, we actually kept our cluster very small and [inaudible 00:56:11] but effectively, we built small clusters per program, per workload so we have thousands of clusters, but very small. We host all of our DNS server on Kubernetes as well. We use [inaudible 00:56:27] DNS. In fact, we hosted the first- ever .mil on the Cloud, DNS Server. [inaudible 00:56:34] .mil my domain. If you go on [inaudible 00:56:38] .mil or any of our domains, I’ll put it on the chat, this DNS is hosted on Kubernetes, so that’s pretty cool.
Jim Shilts: Cool. Thank you very much Nicolas. Thank you for taking the time to answer another one there. We’re right at the top of the hour for you. We’re all going to stick around to do some virtual socializing together, but I really appreciate you being here. This was recorded and we’re going to get it edited and on the site in a few days. A few people asked about that as well.
Nicolas Chaillan: Thanks for having me and please, we’re doing “As Me Anything” sessions every three weeks. The next one actually is on, I think it’s on Tuesday at 1:00 PM, so if anyone has more questions and wants to learn more about what the DoD is doing on DevSecOps, the next one is on Tuesday at 1:00 PM, on Zoom. Feel free to join us.
Jim Shilts: I appreciate it, Nick. Thank you very much.