Automation is a game changer in the world of software development, and I can’t stress enough how crucial it is for enabling teams to develop faster and more effectively. If there’s one mantra I live by, it’s this: if it can be automated, it should be automated. And if it can’t be automated yet, then it’s time to invest in your product to make that automation possible.
The Power of Automation
Let’s break down why automation is so vital:
Automated Deployments: Imagine a world where every deployment is seamless. With automated deployments, you can refresh every security key, certificate, and environment without the risk of human error. This is the approach taken by Azure DevOps, where they build new environments for each deployment rather than upgrading existing ones. This ensures that you’re always working with the latest and greatest without the risk of legacy issues creeping in.
Consistency Over Human Error: Humans are inherently inconsistent. We can’t follow a set of steps in the same way every time, which is where automation shines. Robots, or automated processes, follow a defined set of steps flawlessly. If something goes wrong, you can pinpoint whether the issue lies in the process itself or in the execution. This clarity is invaluable.
Real-World Consequences of Neglecting Automation
Let me share a cautionary tale: the Knight Capital Group. They had $450 million in the bank at the start of the day, but due to a manual deployment error—where one out of seven servers was not updated—they lost thousands of dollars every second. By the end of the day, they filed for Chapter 11 bankruptcy. This disaster could have been entirely avoided with proper automation in place.
Similarly, consider the global impact of incidents like the one involving CloudStrike. These situations highlight the necessity of automated checks and deployments. As the frequency of deployments increases, so does the need for robust automation to manage the complexities that arise.
Embracing Production
One of my favourite quotes comes from Brian Harry, the former product unit manager for Azure DevOps: “There’s no place like production.” No matter how much testing and validation you perform, production issues are inevitable. You can’t simulate the real-world environment perfectly, and that’s why it’s essential to build quality into your processes from the start.
Building Quality In
Instead of merely testing quality in, we should focus on building it in. This means getting your product in front of real customers as quickly as possible. Take Facebook, for example. When developers roll out a new version, they execute calls against both the current production version and the new version simultaneously. This allows them to compare performance and functionality in real-time, scaling from a small user base to millions almost instantaneously.
The result? A staggering turnaround time of about 12 to 13 minutes from code commit to production deployment, complete with full regression testing. This level of automation allows for rapid iterations and quick fixes, ensuring that the product remains robust and responsive to user needs.
Conclusion
In today’s fast-paced market, automation is not just a luxury; it’s a necessity. It empowers teams to deliver features quickly, respond to problems efficiently, and seize opportunities as they arise. By embracing automation, we can ensure that our products are not only fast and reliable but also capable of adapting to the ever-changing landscape of user demands.
So, let’s commit to automating everything we can. The benefits are clear, and the risks of neglecting automation are far too great. Embrace the power of automation, and watch your development processes transform.
Automation plays a massive role in enabling your teams to develop faster and more effectively. Right, automation is almost the thing that supports your ability to do that, and you should automate everything. If it can be automated, it should be automated. If it can’t be automated, you want to do the work in your product to enable that thing to be automated. Right, so automated deployments, automated testing. I use Azure DevOps as an example a lot because they’ve done a lot of this work and hit a lot of these problems.
One of the things that they started doing was they wanted to automate the changing of security. Right, so on every deployment, every security key, every certificate, everything is refreshed. Every environment, every server, you know, so infrastructure as code, everything is refreshed. So you never deploy; they never deploy to upgrade the version of their service on an existing environment. They build a new environment and put that in and take the old environment out. Right, and these sorts of automations enable you to continuously be as slick as possible. Right, and it means that one thing that’s really important to understand is that humans suck at following a set of steps in the same way every time. That’s what robots are for. Right, robots follow a set of steps continuously. That’s what automation is. Automation follows a set of steps and always follows it the same way and always follows all of the steps. Right, so if you get an exception or you have a problem, there’s a problem with the steps.
Right, when humans are following a set of steps manually, for example, then you don’t know whether the problem is with the set of steps or the problem is with the human following the steps, and that’s a risk you don’t need. It’s absolutely a risk you don’t need. So a great example of that is the Knight Capital Group in the US. It was a company in the US; they had 450 million in the bank at the beginning of the day, and they were doing a deployment of a new version of their system. A lot of things were not quite right; they were repurposing some code in their product. They were doing a bunch of silly things because they didn’t have good quality, but they also were doing a manual deployment, and the engineer that did the deployment deployed to six out of the seven servers that they had.
So the system then started behaving oddly because six of the servers had the correct code, and one of the servers didn’t have the correct code. So if you can imagine a load balancing situation where you’re trying to look at the system, it’s not working, it’s not functioning properly, but you can’t figure out why because some calls are working and some calls are not, and it looks kind of random because it’s the load balancer that’s load balancing between the servers. It took them all day to figure it out, but they’d started losing thousands of dollars every second. And with 450 million in the bank at the beginning of the day, by the end of the day, they had to file for Chapter 11 bankruptcy. They were listed on the New York Stock Exchange, which is why we know what the problem was because they had to file that as part of their bankruptcy filing. That would have been prevented by automation. It would have been prevented by automated testing. It would have been prevented by automated deployment. It would have been prevented by automated checks.
A more recent one that had a massive global impact was CloudStrike. Right, that would have been prevented by automation. It would have been prevented by automated deployment. It would have been prevented by automated checks. It would have been prevented by these types of capabilities that we’re talking about. As you increase the number of deployments that you do, you’re forced to deal with these types of scenarios. Right, how do I roll out to a smaller group of people so that I can figure out whether… One of my favourite quotes is from a gentleman called Brian Harry. Brian Harry was the product unit manager for the Azure DevOps team, so he ran that whole developer division at Microsoft for many years, and one of his mantras was that there’s no place like production. You know, like kind of Dorothy type of thing, clicking the red shoes. There’s no place like production. No matter how much testing you do, no matter how much validation, no matter how much money you throw at that, no matter how much time you throw at that, you’re going to have production issues. You’re going to have production issues because you can’t simulate production. It’s not fundamentally possible. You can do your best, and you can spend an awful lot of money trying to figure out how to simulate production as much as possible, but there’s always gaps. It’s not possible to simulate production, to simulate the type of transaction, to simulate what users do. It’s not possible.
So a better strategy than testing quality in is to build quality in. And if you’re building quality in, you want to get that product in front of real customers in production as quickly as possible. I Google… Google does… No, Facebook. It’s Facebook. I was thinking of Facebook. They do a really interesting thing where when a developer’s rolling out their new version of the product, they have a point in time when a call into Facebook is executed twice. It’s executed with the current production version, and then it’s executed with the new version that’s not in production yet. So it executes, executes, executes, executes, executes, and then they can turn up the dial and go from a small, like 10,000 users, up to 10 million users, up to 100 million users doing this. And developers can see the telemetry for what’s happening with this. Is it performing well? Is it doing the right thing? Is it having similar… you know, comparing the output from the two?
And what they actually do is they do it completely automated. So the time from a developer committing a new capability to it replacing this production capability that’s there is, as I understand, about 12 to 13 minutes. And that’s with a full test suite, full regression, full validation of do they operate the same way? Do they have the same output that we need? Do they work in that context? Do they perform and scale out across the entire platform in about 13 minutes? So they can have these small changes, small fixes go out really quickly. And then when they work on bigger things, perhaps they’re using feature flags or they’re using other capabilities.
So automation, that’s an automated process. Automation is absolutely critical to your ability and your product’s ability to have fast, reliable… the fast, reliable ability to add features, fast, reliable ability to deal with problems, to deal with surprises and opportunities as they arise in your market.