video

Continuous Delivery Without Compromise - Why Best Practices Don’t Exist in Complex Systems

Published on
3 minute read

In this video, I dive into a question I often get: What are the best practices for enabling continuous delivery? The answer may surprise you—there are no best practices in complex environments. Instead, there are adequate practices that must adapt to the ever-changing dynamics of your team, product, and market. Watch as I explore the philosophies and techniques that can support your organization in delivering quality software continuously while maintaining user trust and system reliability.

📚 Chapters

  1. 00:00 Introduction – Why best practices don’t exist in complex environments.
  2. 02:15 Adequate Practices, Not Best Practices – Adapting to your unique context.
  3. 05:00 Audience-Based Delivery vs. Environment-Based Delivery – Moving away from traditional Dev-Test-Staging-Production models.
  4. 08:45 Testing in Production – Embracing the reality of modern systems.
  5. 11:30 Real-World Example: Azure DevOps Team – How audience-based delivery transformed deployment.
  6. 16:00 The Role of Telemetry in Continuous Delivery – Monitoring, feedback, and decision-making.
  7. 19:30 Circuit Breaker Pattern and System Resilience – Ensuring users can work even when parts of the system fail.
  8. 25:00 Fix It, Find It, and Fix It Again – A philosophy for evolving automated checks.
  9. 29:00 Final Thoughts – Continuous improvement as a relentless pursuit.

🎯 Who This Video is For

  • Engineering Teams & Leaders: Seeking ways to support frequent, high-quality deployments.
  • Delivery Managers & Product Owners: Balancing user experience with delivery speed.
  • CTOs & System Architects: Adopting adaptive philosophies in system design.
  • Software Teams Moving to Continuous Delivery: Transitioning to modern deployment strategies.

🌟 What You’ll Learn

  • Why best practices don’t apply to complex systems and what to use instead.
  • How audience-based delivery models enable better testing in production.
  • The importance of telemetry and feedback loops in continuous delivery.
  • How the Azure DevOps team leveraged audience-based delivery and automation to evolve their systems.
  • Why resilience patterns like the circuit breaker are critical for maintaining user trust during system failures.

💡 Key Takeaways

  • No Best Practices: Instead, focus on adaptable, adequate practices suited to your unique context.
  • Audience-Based Delivery Works: Deploy small changes to select users and expand iteratively.
  • Testing in Production is Necessary: There’s no substitute for real-world validation in complex systems.
  • Resilience is Key: Implement strategies like the circuit breaker pattern to ensure continued usability.
  • Relentless Improvement: Continuously update your practices and systems to support quality and stability.

🔗 Ready to Enable Continuous Delivery for Your Team?
At Naked Agility, we help organizations embrace modern practices, adopt adaptive philosophies, and deliver quality software with confidence. Visit www.nkdagility.com  to learn more about how we can support your journey to continuous delivery. Let’s create systems that evolve with your needs! Watch on Youtube 

I often get asked about best practices that help teams do continuous delivery. How do, how one of the best practices? And I’m going to start right up front by saying there’s no such thing as best practices when you work in the complex environment. There’s no such thing as best practices. Best practices are for simple work in simple environments where you can have a procedure and you follow it continuously. It becomes the best practice, the best way to do it, and you get the same results every time. That’s not the world that we live in.

So the phrase I quite often use, which is I guess a little bit passive-aggressive, is there’s no best practices. There are only adequate practices for the situation at hand, and the situation might change. Right? That’s fundamentally what we’re talking about. But there are a bunch of practices that we see many organizations leveraging and getting success from, and we should try them and see if they work for us. That maybe makes more sense than best practices.

So the question usually is, and if I take out the word best, there, what practices enable cross-functional collaboration to support continuous delivery without compromising quality? One of those practices is some way to control what code ends up in production or not. That’s a very powerful practice, or what code is enabled for people. That’s probably a better way to just say it. Most organizations, most products have moved or are moving towards or are thinking of moving towards more of an audience-based deployment pattern or delivery pattern rather than an environment-based delivery pattern. Right? And so there are still environments within the context of this, depending on how it’s set up.

But one of the core practices that supports this idea of continuous delivery, that supports this idea of continuous quality in production, is definitely moving towards an audience-based delivery strategy. So in the old, the old, in the ye olden days, the delivery strategy was Dev, test, staging, production. Right? Dev, test, staging, production kind of thing. And everything was done in Dev. The developers built all the stuff, and then it got pushed to test. Testers tested all the stuff, and then it got pushed to staging and something else, usually load testing there, and then it got pushed to production. Maybe if you’re deploying to customers, they have in the way. Right? There was also a UAT environment. These are all costs. These are all at costs, and they’re extreme costs, and they’re not worth it costs. They not only have a cost to actually do at the time, they have massive, massive cost implications on our ability to build the right thing. They have massive cost implications on the cost of fixing stuff later, and they have massive cost implications because we’re effectively testing quality in rather than building it in. Right? Testing quality in is the most expensive way to gain quality. Building quality in is how we should be doing that.

So this practice of audience-based delivery means we switch to a model, and I’m going to use the words, I’m going to use, make it sound like we’re testing in production. And in fact, that’s one of the terminologies that we do use in that context is testing in production. Right? And the reality of the world in which we live in, building these complex interconnected systems that we all build and work on, is that there is no place like production. There’s no way to simulate production. There’s no way to truly validate that what you’ve done works in production until you get to production. So wouldn’t it be better if we can get a small change quickly into production for a small set of users and then be able to increase or decrease that user set on demand so that we can validate that the product works in real-world and real scenarios? And that’s effectively what we talk about with this set of practices, this idea of shifting left and continuous delivery. And there’s a lot of practices that help with that.

So audience-based deployment model is probably the main thing. And if you’re thinking, “Oh, our product is too big and too complicated to be able to do that,” the Windows team moved to that. Windows is deployed on an audience-based model rather than a more traditional environment-based model. There’s, because there’s a physical product that’s physically deployed to people, there’s still a little bit of the old school environment in there for sure. So it’s not a complete thing. Cloud products, you can go complete, but they go, their time from cutting code to it being in production with real users is only, I think for themselves, it’s only a few hours. Like internal to the Windows team, but nightly, as I understand, at least nightly, they’re deploying new versions of Windows out to all of the participants within Microsoft.

So if you’re inside of Microsoft and you take a BG, that’s their internal IT department BG build of Windows, like you’re not self-managing BG build, then you’re getting nightly builds of Windows. Or I think many people are. That means that what the developers wrote yesterday, you’re testing today, and it’s in production. Because you’re, you know, you’re a manager in Microsoft. You’re doing your day job, which is managing people. You might be managing marketing people, right, inside of Microsoft, or managing consultants or managing whatever. And your machine has the latest version of Windows. You’re using it in production. So that’s getting into production as quickly as possible.

And then what they’re doing, the engineering team, is they’re monitoring the telemetry. This is the audience-based deployment model. They’re monitoring the telemetry and deciding whether they want to open that particular build out to more people. And when they open it out to more people, the next ring, I guess Microsoft calls them rings, ring-based deployment model, right? But it’s really audience-based. Each ring has an audience of people, and they’re all in production, and they’re just opening it out to more and more people. That’s a pretty simple version because it is a physical product that’s deployed on your machine, right? Physical, which is your operating system.

So it’s got to run on bare metal and in cloud, but it runs on metal. Right? But if you look at something like Microsoft Teams, Office 365, right? They have the ability to switch on and off features for specific users. So regardless of what build is shipped in Microsoft Teams, for example, and I have, I’m in the TAP program for Teams, basically their version of the insiders, and I get features before the general public. The, the, everybody gets features, and that can be specific to me as an individual user within my company or all users within my company. And that enables that choice. Right? So you, even as a customer in the TAP program, I can choose that I get the, oh my goodness me, the cutting edge latest and greatest, and somebody else in my business gets the reasonably stable. They’re kicking the tires, ready for moving to a wider audience, more general public audience. And then general public have a way to opt into some extra features and things.

So we’re all able to communicate with each other, right? We can all join the same call, and some people are using more different features from other people within the context of that call. Some of them have new capabilities, whole new code bases that are running their part of that story. And it’s really interesting because I do calls with folks at Microsoft, and I’ve had folks at Microsoft who are on much earlier builds than me because they’re choosing to help out that team or they work on that team. And yeah, occasionally their call drops, and they have to log back in, like, “Oh, sorry, early build, I got a bug.” And there’s a risk-benefit analysis there. If you’re working in a company and you want to take the earlier features so that you can pre-validate them for your company, understand what they are to help with training and whatnot of people in your company, understand what’s coming down the pipeline, then you can choose to do that. But you’re choosing to take a little bit of risk, right? Because it’s going to be a little bit less stable.

This is this idea of testing in production. I’m not expecting a complete crash of everything and nothing works, right? But the occasional glitch, the occasional weirdness, I’m good with that. I teach training classes on Microsoft Teams. I teach all my classes on Microsoft Teams using Microsoft Teams breakout rooms, using all those things. I’m in the TAP program. I have an earlier capability. Occasionally, things go a little bit weird for me. That’s just a teaching moment in the class because we’re talking about how we deliver software and how we deliver products. And part of that is accepting that there are going to be some mistakes. There are going to be, if you’re doing continuous delivery to production, there are going to be things that get past your automated gates, right? And end up in production. It’s what you do with that information.

That’s one of the best complimentary practices. I’m going to use the word best there. Is that philosophy and how you do it. In fact, it’s not even a practice. It’s a philosophy. You need to have a philosophy of find it and fix it. So if something does make it past into production and you’re doing continuous delivery, you need to figure out how, why, how did this get past my automated checks? And how can I change my automated checks to be able to catch those things? That’s it. If you find, “Oh, it’s not possible to change our automated checks because of the way we’ve architected the system,” then this, you might be asked, I would expect a team to be asking themselves the question, “Should we be changing our architecture so that these types of problems can’t make it into production?” And how long is that going to take?

A great example, the Azure DevOps team had a bunch of incidents where one service that really shouldn’t be mandatory took out the entire platform. Right? So they’re running an online platform, and for example, the profile service, this was their first example. The profile service stops working. Does it matter that you’re showing the ID of the user or the good of the user versus the friendly name of the user? Because the pro, you get that friendly name. I’ve got the good ID. I get the friendly name. But what if that profile service is down? Would you rather your entire system was down or it showed a good in place of a username in some cases? Right? Some small number of cases, I’d rather it showed the good and the system still worked because then my users can still do their job. My users can still use the system. They just see a small controlled glitch. Right? And then when that profile service comes back up or we fix it, that turns back on again.

And there’s a pattern, a coding pattern called the circuit breaker pattern. And it’s exactly what you think it is. When one of the services stops working, it breaks the circuit. And then every so often, it tries the circuit to see if it’s back up. And if it’s not up, it just breaks the circuit again and then waits a little bit longer and then tries the circuit. If it still doesn’t work, it breaks the circuit and waits a little bit longer. So this service on this site is not down because it can’t connect to this service. And the Azure DevOps team had this problem that the profile service took out the entire system. So millions of developers all over the world were unable to look at their code, do their work items, do all these things because the friendly name couldn’t be displayed. I mean, a bit factious with that, but the profile service was down. That’s insane.

So one of the practices that you need to think about is one of the, let’s call it philosophy. The philosophy you have to think about is, “No, we need to change it. We need to look at the impact to our users and make decisions based on our ability to maintain our service, maintain high levels of quality, maintain the ability for people to continue to work within the context of our product even when the unavoidable happens, which is systems are going to break, systems are going to be down.” How do you cope with that? That’s probably, if I was to say there’s a best practice, it’s not a best practice. It’s maybe a best philosophy, and that’s to continuously seek to better your product, better its ability to support its users, and do that continuously and relentlessly.

video DevOps Deployment Frequency Agile

Connect with Martin Hinshelwood

If you've made it this far, it's worth connecting with our principal consultant and coach, Martin Hinshelwood, for a 30-minute 'ask me anything' call.

Our Happy Clients​

We partner with businesses across diverse industries, including finance, insurance, healthcare, pharmaceuticals, technology, engineering, transportation, hospitality, entertainment, legal, government, and military sectors.​

Hubtel Ghana Logo
Illumina Logo
ProgramUtvikling Logo
Kongsberg Maritime Logo
Ericson Logo
Lockheed Martin Logo
SuperControl Logo
Teleplan Logo
New Signature Logo
Freadom Logo
MacDonald Humfrey (Automation) Ltd. Logo
Philips Logo
Akaditi Logo
DFDS Logo
Workday Logo
Boeing Logo
Slaughter and May Logo
Bistech Logo
Washington Department of Enterprise Services Logo
Ghana Police Service Logo
Department of Work and Pensions (UK) Logo
Nottingham County Council Logo
Royal Air Force Logo
New Hampshire Supreme Court Logo
ProgramUtvikling Logo
Slaughter and May Logo
Workday Logo
YearUp.org Logo
Emerson Process Management Logo
Philips Logo