When it comes to developing complex products, one of the most significant steps we can take as developers and engineering teams is to embrace automation. In my experience, this is not just a best practice; it’s essential for success. Alongside reducing the size of our backlog—essentially making our deliverables smaller so we can iterate more frequently—automation becomes a cornerstone of our development process.
The Power of Automation
Automating everything is crucial for instilling confidence in our engineering teams. Here’s what I mean by that:
Automated Testing: This is non-negotiable. We need to ensure that every piece of code we write is automatically tested. This not only saves time but also helps catch issues early in the development cycle.
Automated Deployment: Continuous deployment to production hinges on our ability to automate the deployment process. The more we can automate, the more frequently we can release updates.
Automated Validation: Collecting telemetry data from our products allows us to validate our assumptions and decisions automatically. This feedback loop is invaluable.
A prime example of this can be seen in the Azure DevOps team. When they transitioned to deploying their cloud version of the product, they faced significant challenges. They were accustomed to biannual releases, but suddenly, they were aiming for three releases every week. This shift required a complete overhaul of their approach to automation.
Tackling Technical Debt
The Azure DevOps team had to confront their technical debt head-on. They were dealing with poor-quality code and a testing infrastructure that was, frankly, terrible. Imagine waiting 48 hours to find out if a code change was successful! Their regression suite took even longer—up to a week. This was simply unacceptable.
To address this, they focused on reducing their cycle time—the time it takes to learn from an idea to deployment and back again. They identified their testing infrastructure as a major bottleneck. By converting long-running functional tests into short, efficient unit tests, they flipped their testing pyramid. Over four years, they transformed their testing strategy from 80,000 long-running tests to an impressive 880,000 short tests. This monumental shift reduced their feedback loop from 48 hours to just three and a half minutes.
Empowering Developers
One of the key changes they implemented was enabling developers to run command-line calls to set up the necessary components for local testing. This meant that developers could have a local copy of the system running, allowing them to test their work in real-time. This empowerment not only boosted individual confidence but also fostered a collective sense of capability within the team.
Building Customer Confidence
As we improve our internal processes, we also enhance our customers’ confidence in our product. When things go wrong—and they will—it’s crucial to own up to mistakes and rectify them swiftly. Customers who have faith in our ability to fix issues will be more forgiving when problems arise. They’ll understand that mistakes are part of the journey, especially when they see our commitment to quality and improvement.
The Importance of Quality
In complex systems like Azure DevOps, confidence must permeate every level of the organisation. Each team member needs to trust that the components they rely on from other teams will meet the necessary quality standards. This trust is built through a shared commitment to quality and attention to detail.
Conclusion: The Path to Continuous Delivery
In summary, the journey towards continuous delivery is paved with automation. There should be no room for manual testing in our processes. While we may need temporary solutions—like the Twitter sentiment bot that monitored negative feedback during deployments—these should only serve as crutches while we work towards full automation.
By pushing the boundaries of what we can automate, we not only improve our code and quality but also enhance our engagement with customers. The road may be long, but with dedication and a focus on automation, we can achieve remarkable results. Let’s keep pushing forward, embracing the challenges, and striving for excellence in everything we do.
When you are working on very complex products, one of the main steps developers and engineering groups can take is automating everything. I mean, that’s probably, along with reducing the backat size, so making the things that you’re delivering smaller, so that you can do more of them and you’re iterating on them, right?
One of the key things that enable your engineering team to have the confidence that they can continuously deploy to production those small things is to automate everything. You should have automated testing, you should have automated deployment, you should have as much automated validation as you can. There are automated validations you can do, especially if you’re collecting telemetry in your product. A great example of that was the Azure DevOps team when they first started deploying their cloud version of the product. They didn’t have, you know, they’d been used to doing two yearly deliveries and suddenly they were doing three weekly deliveries.
They had technical debt, they had poor quality code, they had big gnarly chunks that were very difficult to edit, and they had automation that took a very long time to run. So, one of the two big focuses, big engineering pushes that they made in that team that I think had the biggest impacts on me observing their improvements, and the first one was that they reduced as much as possible that cycle time, right? They wanted to reduce that time to learn as much as possible.
Time to learn is all the way from coming up with an idea for a feature, it gets all the way through being built, or some of it being built, being deployed to production, or some of it being deployed to production and collecting telemetry, and then feeding that back all the way around to the next loop. That’s time to learn all the way around. So, figuring out what the biggest time suck is in that space and tackling that.
For the Azure DevOps team, they found a number of things. One of those things was their testing infrastructure. Their testing infrastructure was, for want of a better expression, terrible. It took, as I understand, about 48 hours to find out for developers if they made a code change whether it was successful. Their time to self-test, right? Their ability to test for themselves was incredibly long. That’s 48 to 72 hours, and the time to run the regression suite was even longer than that. It was perhaps a week to get that done because they had long-running regression tests.
One of the biggest focuses, biggest pushes they had was on converting all of those long-running functional tests into short code tests, right? They were top-heavy on their largest number of tests, which were these long-running functional tests. Their smallest number of tests was unit tests, and they flipped that pyramid over. It took them four years because, remember, they’d been working on this product for six to eight years in a waterfall way, and they built this massive test infrastructure.
So, it took them four years of doing a little bit at a time, paying back that technical debt to get to the point where they’d flipped that pyramid over. In fact, they just removed whole layers of that pyramid, and they ended up with all of these fast-running unit tests. Instead of 80,000 long-running automated tests, they had 880,000 short tests. They moved that number from 48 hours to find out if a developer had done something wrong to three and a half minutes. They could run the whole test suite in three and a half minutes on their local machine.
They could stand up, via command line calls, any parts of the system that they needed in order to be able to do functional tests locally on their machine. That was one of the other changes that they made. How do we enable that so that developers can just run a command and it sets up the bits they need to test the stuff that they’re working on locally? So that they can have a copy of the system running locally and walk through it.
They built all of that functionality. It took them a long time to get there, but that investment helped improve their confidence in their ability, not just as individuals but as a group, to be able to deploy and build features in the product and improve the confidence of their customers. Right? Their customers had greater confidence that this is a good product, it’s a solid product. Yes, we’ve seen things go wrong, but when things go wrong, they own up to them, they fix them, they move forward, they don’t make that mistake again.
So, that building, because it’s okay to make mistakes, if you’ve got a customer that you make a mistake and they throw a tantrum and they throw their toys out of the pram, it’s because they’re used to working with vendors that they have low confidence in, and they have low confidence in you fixing it. So, they believe that they have to throw that tantrum in order for you to fix that thing.
You need to build their confidence in you and in your product, and then they’ll stop doing that. They’ll accept, “Oh, something went wrong, you messed up,” but we know you’re going to fix it, we know you’re going to do a good job, and we know that you’re not going to do it again, right? Or it’s unlikely that you’ll do it again. That’s confidence.
When you’re working in big complex systems like Azure DevOps or other types of systems that you have, you need that level of confidence, not just at the whole product level but at every individual team level. I, as a person working on a team, need to have confidence that if I’m using capabilities delivered by another team, that they’re going to be solid, that they’re going to work, that they’re going to meet that quality bar that I need in order to do my work, and I’m not going to be your crutch to enable functionality.
So, this is really, really important to have that underlying attention to detail, attention to quality, shorten those feedback loops, build up that automation, automate everything. There should be no manual tests. One of my favourite examples of that is the Azure DevOps team. They had poor telemetry in their product when they first moved to this cloud environment, and they were deploying through regions.
So, they would deploy to a region and then they would check to see that everything was working properly. They would check to see whether they adversely affected users. They’d do this all manually, and then they would deploy to the next region. One of the developers, an intern, for fun, developed something they called the Twitter sentiment bot.
So, this was a little bot that trolled Twitter for negative comments about your product. What they would do is they would do a deployment to a particular region, and then the Twitter sentiment bot would monitor Twitter for a couple of hours to see whether the level of negativity about your product increased in any way. You’re always going to have a baseline of people that are unhappy with stuff, right? So, if it increased in a certain way, then they would automatically stop the deployment and flag that there was something that needed to be looked at.
That’s a crutch, right? That’s a way to automate something that isn’t automatable because of the way you built your system. They eventually didn’t need that bot anymore because they had those full automations built into the system. But you need crutches while you get there to be able to push those boundaries and keep pushing for continuous delivery, keep pushing to do things faster, keep pushing to improve your code, improve your quality, and improve your engagement with your customers, and you’ll get there eventually.