Featuring Ajay Chankramath, Principal Technologists @ Thoughtworks and Matt Weaver, Director of Platform & Release Engineering @ AvidXchange
AvidXchange is a mid-size financial services company which is going through an amazing DevOps transformation. In the traditional delivery model we have been using for several years, the path to production typically takes about 12 weeks in the best case scenario. This experience report will discuss the optimization we went through to reduce the time taken for the path to production from 12 weeks to 12 minutes.
In this presentation we will go into various components of the path to production optimization. Specifically we will discuss how to manage infrastructure, environment, pipelines, security and incident management. In addition, we will talk about the relevance of Platforms we built, to enable this and the technical capabilities required by the organization for accelerating our delivery. Not everything went well as part of this journey. We will go into more details on the cultural, organizational, technical and process changes and struggles we are encountering and how we are navigating these landmines.
As part of this transformation journey we were able to realize our revenues significantly faster, fail fast and experiment with newer microservices faster. This has opened up amazing opportunities for our business during the challenging times over the past year.
Hi. Welcome. I’m excited and thankful for the opportunity to tell a part of AvidXchange’s transformational journey.
Agenda: Who we are, What’s our approach, Steps on our Journey, Challenges, and finally Outcomes & Learnings.
I’m Matt Weaver Director of Platform Engineering, hired as Director of DevOps, but we’ll talk more about that. Avid is a Unicorn Financial Services company that automates Accounts Payable (AP). We are in Charlotte, North Carolina. We work from procurement, through invoice, workflow, AND payment. We even have offering for suppliers to understand their cashflow. We help companies pay their bills.
Co-presenting in Ajay Chankramath – Principal Technologist at Thoughworks. Global consultancy on Software Strategy, Design, Delivery, Transformation in 17 countries, 48 offices.
Pioneer in Agile, Continuous Delivery, Transformation,….
On our approach, the goal was pretty clear. The business demands us to move fast. We are in a unique position in the market with tremendous opportunity. Always looking for new features, and new offerings. Its important to us to maximize how quickly we can experiment with ideas, and how quickly we can get them in front of customers. With the goal in mind the next question was “how do we know we are being successful?” We found inspiration in the Accelerate book. We were so moved by the research and learnings that our CIO bought a copy for the whole leadership group. We’ll talk about Delivery Lead Time, and Deployment Frequency today, but for this talk we’ll consider Change Failure Rate, and Mean Time To Recover to be next steps.
Besides our goal, we also have some strategic perspective in the sense that we know we want to build platforms. We want to promote autonomy by building self-service platforms that enable teams to take charge of their delivery. We also wanted to promote the DevOps culture & distance ourselves from the idea of a “DevOps team” that serves as a catch-all for activities that slip the cracks. And then, not so much a deciding or motivating factor, but when we knew we wanted to build platforms – internal products – we also knew that strong Product Owners were going to be critical to our success.
Moving on to our journey – the first thing we wanted to do was increase the observability into our applications. We wanted less incidents, we wanted to find out about those incidents before our customers, and we wanted to fix them quicker. We also wanted to be able to create feedback loops – both on delivery (thinking back to our original goal) & feedback loops for the business to ensure that we are building the right features. As we started off, we also knew we wanted to solve observability using the platform approach, with a dedicated PO understanding what teams needed to self-serve observability. The visual contains a high level design of what we built. It includes a time series data store (Azure Monitor/Log Analytics), as well as I a nice visualization technology (Grafana). A lot of the value that this platform provides is the availability of data from commonly needed sources, as well as the ease for getting our developers applications onboarded.
Next step in our journey, we literally wanted to look at our path to production process. We were looking for speed and a way to help mature our Site Reliability Engineering practice. We knew we wanted to promote Continuous Delivery, and we wanted to promote the Dev/Ops mindset via product development teams self-running their applications for a period of time. We hypothesized that teams responsible for production would have a high incentive to address operational concerns. So how does it work? Teams build their applications, come to the Launch Readiness Review, address any findings, and they are eligible to decide as a team and stakeholders when to release. This is one way we promote team autonomy, and we automatically take care of compliance evidencing to boot! Want to also address the idea of a handoff readiness review – we are defining this process in Q2, this is the process for adding embedded SRE teammates to a team in support of reliability. There are likely two criteria for this decision 1) based on the operational and reliability maturity of the application, and 2) based on the importance of the service to the business/revenue.
The last part of our journey I want to talk about today is how do we make it quicker and easier to build the infrastructure and deployment automation pipelines. Again the goal is all about speed, and quality as well. We took multiple approaches to solving this problem, starting with “pipelines as a service” “devops” team, as well as 2 iterations at building a platform around this opportunity. We ultimately settled on the following guiding principals to anchor our implementation: composability, easy to use & extend, self-service, secure, providing consistent build & test, cover common infrastructure resources, and use industry standard tools. We are happy to report that milestone 1 is complete and enables developers with access to some modules to easily build Azure resources with Terraform and manage them with Azure DevOps multi-stage pipelines.
Now lets talk about some of the challenges we faced. We talked a lot about platforms & how important this idea is for us, but I don’t want you to take away the idea that you simply need to build some platforms and people will use them and your delivery will automatically transform. This isn’t the field of dreams! We has to set aside intention time to work with teams on adopting the platform. Sometimes this was up to 20% of our capacity on a given workstream. We also had to not only upskill ourselves but also produce a set of videos, articles, etc to help our developers do the same. We have to acknowledge that change is uncomfortable and we are asking folks to change. We had to over communicate, we talked about our philosophy and our why, we talked about roadmaps and plans/backlogs, our product owners also spend a lot of time collaborating with teams – 2 way communication & feedback.
Our other set of challenges are all around incentives. With a functionally aligned organization, were we incentivizing teams to be really good at their function, as opposed to the outcome? Without a purpose (like building self-service platforms), the DevOps team became a junk drawer of issues that other teams couldn’t or didn’t want to solve. In the beginning we had a heavy focus on ceremony – cargo cult culture. Credit to our leadership team we have pivoted away from that focus, and are looking more at outcomes and driving good behaviors now. We had to pivot from a centralized and unempowered ops team to a more domain aligned SRE model with the right accountabilities. Last two are closely related. We saw evidence of agentic state within a small portion of our team. This was an organized/orchestrated transformation, we asked people to change their habits and behaviors and we met some resistance. This is a good lesson learned but we aren’t sure how else to drive a transformation. One of the things we did is try to understand how folks are motivated. We latched on to the concepts of purpose, autonomy, and mastery.
On our results. We can clearly see that our teams working in post transformation mode significantly beat the legacy teams on Delivery Lead Time. We can also see the deployment frequency average for the entire organizing getting pulled in the right direction by teams that have adopted the new path to production.
On lessons learned, there are 5 big ones. As the last slide shows the Platform Operating model has driven delivery accelerate, which is what we wanted to begin with – so we are happy. We also had to adjust our mindset a little bit; a transformation is a journey of continuously and intentionally improving. Transformation is about everyday making Avid better when we sign off than when we signed on. We also had to intentionally thing about and recognize what is important to us and why. We call these things outcomes and working on the right problems is of the utmost importance. We had to have a good strategy, but we also had to intentionally execute on that strategy, we also had to show some leadership. Making tough decisions, having resolve, and confronting uncomfortable truths. But those combination of activities led us to success. Finally, we learned that resourcing is a zero sum game. We can’t execute on every idea, we have to be discerning. We have to be intentional and acknowledge the tradeoffs for our decisions – this is why understanding what we want to do and why is so important