Automate Everything: Building Reliable, Fast, and Scalable Products
Automation is the backbone of modern software development, enabling teams to build, deploy, and validate faster while maintaining quality. In this video, I explain how automating everything—deployments, testing, validations, and more—helps reduce risk, improve reliability, and allow your team to respond quickly to both opportunities and challenges.
📚 Chapters:
- 00:00 Introduction – Why automation is essential for engineering excellence.
- 01:20 Automate Everything – If it can be automated, it should be automated.
- 02:45 The Risks of Manual Processes – Real-world examples of catastrophic failures like Knight Capital and CloudStrike.
- 05:15 The Value of Production Feedback – Why there’s no place like production.
- 06:30 Lessons from Azure DevOps – How automation transformed Microsoft’s development processes.
- 09:20 Facebook’s Approach to Automation – Parallel execution and rapid deployment.
- 12:00 Automation as a Competitive Advantage – Faster, more reliable releases to outpace competitors.
- 14:00 Call to Action – How Naked Agility can help your team achieve engineering excellence.
🎯 Who This Video is For:
• Developers: Looking to reduce manual work and build more reliable systems.
• Engineering Leads: Interested in fostering a culture of automation.
• CTOs and Tech Decision-Makers: Exploring strategies to scale development teams and improve delivery pipelines.
• Organizations: Struggling with deployment issues, quality concerns, or production failures.
🌟 What You’ll Learn:
• Why automating deployments, testing, and validations is critical for modern software engineering.
• How manual processes introduce risk, inefficiency, and errors.
• Real-world lessons from Azure DevOps and Facebook on the power of automation.
• How continuous integration and delivery can help you reduce time-to-market and improve quality.
• The importance of telemetry and observability for validating production performance.
💡 Key Takeaways:
• Automation Reduces Risk: Manual steps are prone to errors. Automation ensures consistency and reliability.
• Build Quality In: Focus on designing systems that inherently reduce errors rather than testing quality in later.
• Embrace Fast Feedback Loops: Deploy small, incremental changes quickly and validate them in production.
• Adopt Proven Practices: Use strategies like feature flags, automated regression testing, and parallel execution to improve efficiency and reliability.
• Respond to Market Changes: Automation allows you to quickly adapt to customer needs and market opportunities.
🔗 Ready to transform your team’s engineering practices?
Visit
www.nkdagility.com
to learn how Naked Agility can help you automate effectively, build quality into your products, and gain a competitive edge. Let’s create systems that scale and succeed!
Watch on Youtube
Automation plays a massive role in enabling your teams to develop faster and more effectively. Right, automation is almost the thing that supports your ability to do that, and you should automate everything. If it can be automated, it should be automated. If it can’t be automated, you want to do the work in your product to enable that thing to be automated. Right, so automated deployments, automated testing. I use Azure DevOps as an example a lot because they’ve done a lot of this work and hit a lot of these problems.
One of the things that they started doing was they wanted to automate the changing of security. Right, so on every deployment, every security key, every certificate, everything is refreshed. Every environment, every server, you know, so infrastructure as code, everything is refreshed. So you never deploy; they never deploy to upgrade the version of their service on an existing environment. They build a new environment and put that in and take the old environment out. Right, and these sorts of automations enable you to continuously be as slick as possible. Right, and it means that one thing that’s really important to understand is that humans suck at following a set of steps in the same way every time. That’s what robots are for. Right, robots follow a set of steps continuously. That’s what automation is. Automation follows a set of steps and always follows it the same way and always follows all of the steps. Right, so if you get an exception or you have a problem, there’s a problem with the steps.
Right, when humans are following a set of steps manually, for example, then you don’t know whether the problem is with the set of steps or the problem is with the human following the steps, and that’s a risk you don’t need. It’s absolutely a risk you don’t need. So a great example of that is the Knight Capital Group in the US. It was a company in the US; they had 450 million in the bank at the beginning of the day, and they were doing a deployment of a new version of their system. A lot of things were not quite right; they were repurposing some code in their product. They were doing a bunch of silly things because they didn’t have good quality, but they also were doing a manual deployment, and the engineer that did the deployment deployed to six out of the seven servers that they had.
So the system then started behaving oddly because six of the servers had the correct code, and one of the servers didn’t have the correct code. So if you can imagine a load balancing situation where you’re trying to look at the system, it’s not working, it’s not functioning properly, but you can’t figure out why because some calls are working and some calls are not, and it looks kind of random because it’s the load balancer that’s load balancing between the servers. It took them all day to figure it out, but they’d started losing thousands of dollars every second. And with 450 million in the bank at the beginning of the day, by the end of the day, they had to file for Chapter 11 bankruptcy. They were listed on the New York Stock Exchange, which is why we know what the problem was because they had to file that as part of their bankruptcy filing. That would have been prevented by automation. It would have been prevented by automated testing. It would have been prevented by automated deployment. It would have been prevented by automated checks.
A more recent one that had a massive global impact was CloudStrike. Right, that would have been prevented by automation. It would have been prevented by automated deployment. It would have been prevented by automated checks. It would have been prevented by these types of capabilities that we’re talking about. As you increase the number of deployments that you do, you’re forced to deal with these types of scenarios. Right, how do I roll out to a smaller group of people so that I can figure out whether… One of my favourite quotes is from a gentleman called Brian Harry. Brian Harry was the product unit manager for the Azure DevOps team, so he ran that whole developer division at Microsoft for many years, and one of his mantras was that there’s no place like production. You know, like kind of Dorothy type of thing, clicking the red shoes. There’s no place like production. No matter how much testing you do, no matter how much validation, no matter how much money you throw at that, no matter how much time you throw at that, you’re going to have production issues. You’re going to have production issues because you can’t simulate production. It’s not fundamentally possible. You can do your best, and you can spend an awful lot of money trying to figure out how to simulate production as much as possible, but there’s always gaps. It’s not possible to simulate production, to simulate the type of transaction, to simulate what users do. It’s not possible.
So a better strategy than testing quality in is to build quality in. And if you’re building quality in, you want to get that product in front of real customers in production as quickly as possible. I Google… Google does… No, Facebook. It’s Facebook. I was thinking of Facebook. They do a really interesting thing where when a developer’s rolling out their new version of the product, they have a point in time when a call into Facebook is executed twice. It’s executed with the current production version, and then it’s executed with the new version that’s not in production yet. So it executes, executes, executes, executes, executes, and then they can turn up the dial and go from a small, like 10,000 users, up to 10 million users, up to 100 million users doing this. And developers can see the telemetry for what’s happening with this. Is it performing well? Is it doing the right thing? Is it having similar… you know, comparing the output from the two?
And what they actually do is they do it completely automated. So the time from a developer committing a new capability to it replacing this production capability that’s there is, as I understand, about 12 to 13 minutes. And that’s with a full test suite, full regression, full validation of do they operate the same way? Do they have the same output that we need? Do they work in that context? Do they perform and scale out across the entire platform in about 13 minutes? So they can have these small changes, small fixes go out really quickly. And then when they work on bigger things, perhaps they’re using feature flags or they’re using other capabilities.
So automation, that’s an automated process. Automation is absolutely critical to your ability and your product’s ability to have fast, reliable… the fast, reliable ability to add features, fast, reliable ability to deal with problems, to deal with surprises and opportunities as they arise in your market.