https://nkdagility.com/resources/5J8RLcOAE3E copied to clipboard!

Videos Engineering Excellence Product Development DevOps

a·gen·tic a·gil·i·ty

Rethinking Continuous Delivery: Why Best Practices Don't Exist in Complex Environments

TL;DR; There are no universal best practices for continuous delivery in complex environments; instead, teams should adopt flexible, situation-specific approaches. Key strategies include audience-based delivery for rapid feedback, testing in production to validate real-world performance, and a commitment to quickly finding and fixing issues. Development managers should focus on continuous improvement and adaptability rather than rigid processes.

Published on 23 January 2025

Written by Martin Hinshelwood

3 minute read

https://nkdagility.com/resources/5J8RLcOAE3E

I often find myself in discussions about the best practices for enabling continuous delivery within teams. It’s a question that comes up frequently, and I want to address it head-on: there are no best practices in complex environments. Best practices are a concept that applies to simple tasks in straightforward situations where a procedure can be followed consistently to achieve the same results. However, the world we operate in is anything but simple.

Instead, I prefer to say that there are only adequate practices tailored to the specific situation at hand, and these practices can—and often do—change. This is a fundamental truth we must embrace. While we may not have a one-size-fits-all solution, there are several practices that many organisations have successfully leveraged to enhance their continuous delivery efforts. Let’s explore some of these practices and how they can support cross-functional collaboration without compromising quality.

Audience-Based Delivery

One of the most powerful practices I advocate for is the implementation of an audience-based delivery strategy. Traditionally, the delivery pipeline followed a linear path: development, testing, staging, and finally production. This model, while familiar, is fraught with inefficiencies and costs. It often leads to a scenario where quality is tested in rather than built in, which is the most expensive way to ensure quality.

By shifting to an audience-based delivery model, we can deploy small changes quickly to a limited set of users. This allows us to validate our product in real-world scenarios, which is invaluable. The Windows team at Microsoft exemplifies this approach. They deploy updates to internal users nightly, allowing developers to test their code in production almost immediately. This rapid feedback loop is crucial for continuous improvement .

Testing in Production

The concept of testing in production is another critical aspect of this audience-based model. There is no perfect simulation of a production environment; the only way to truly validate our work is to deploy it. By allowing a small group of users to access new features, we can monitor how the product performs and make adjustments as necessary. This practice not only enhances our understanding of user interactions but also helps us identify and rectify issues before a wider rollout.

The Philosophy of Find It and Fix It

When we talk about continuous delivery, we must also adopt a philosophy of “find it and fix it.” If something slips through our automated checks and makes it into production, we need to investigate how that happened and adjust our processes accordingly. This isn’t just about fixing bugs; it’s about evolving our architecture and practices to prevent similar issues in the future.

For instance, the Azure DevOps team faced challenges when a non-essential service caused significant disruptions. They learned that it’s better to allow the system to continue functioning, even if it means displaying a less-than-ideal user experience, rather than taking everything offline. This approach not only maintains user productivity but also provides valuable insights into system resilience.

Embracing Change

Ultimately, the key takeaway here is that we must continuously seek to improve our products and our processes. This relentless pursuit of betterment is not merely a best practice; it’s a philosophy that should underpin everything we do. By embracing change and being willing to adapt, we can create a more robust and responsive delivery pipeline that meets the needs of our users.

In conclusion, while the term “best practices” may be misleading in the context of complex environments, there are certainly effective practices and philosophies we can adopt. By focusing on audience-based delivery, testing in production, and a commitment to continuous improvement, we can enhance our ability to deliver quality software efficiently. Let’s keep the conversation going and share our experiences as we navigate this ever-evolving landscape together.

I often get asked about best practices that help teams do continuous delivery. How do, how one of the best practices? And I’m going to start right up front by saying there’s no such thing as best practices when you work in the complex environment. There’s no such thing as best practices. Best practices are for simple work in simple environments where you can have a procedure and you follow it continuously. It becomes the best practice, the best way to do it, and you get the same results every time. That’s not the world that we live in.

So the phrase I quite often use, which is I guess a little bit passive-aggressive, is there’s no best practices. There are only adequate practices for the situation at hand, and the situation might change. Right? That’s fundamentally what we’re talking about. But there are a bunch of practices that we see many organizations leveraging and getting success from, and we should try them and see if they work for us. That maybe makes more sense than best practices.

So the question usually is, and if I take out the word best, there, what practices enable cross-functional collaboration to support continuous delivery without compromising quality? One of those practices is some way to control what code ends up in production or not. That’s a very powerful practice, or what code is enabled for people. That’s probably a better way to just say it. Most organizations, most products have moved or are moving towards or are thinking of moving towards more of an audience-based deployment pattern or delivery pattern rather than an environment-based delivery pattern. Right? And so there are still environments within the context of this, depending on how it’s set up.

But one of the core practices that supports this idea of continuous delivery, that supports this idea of continuous quality in production, is definitely moving towards an audience-based delivery strategy. So in the old, the old, in the ye olden days, the delivery strategy was Dev, test, staging, production. Right? Dev, test, staging, production kind of thing. And everything was done in Dev. The developers built all the stuff, and then it got pushed to test. Testers tested all the stuff, and then it got pushed to staging and something else, usually load testing there, and then it got pushed to production. Maybe if you’re deploying to customers, they have in the way. Right? There was also a UAT environment. These are all costs. These are all at costs, and they’re extreme costs, and they’re not worth it costs. They not only have a cost to actually do at the time, they have massive, massive cost implications on our ability to build the right thing. They have massive cost implications on the cost of fixing stuff later, and they have massive cost implications because we’re effectively testing quality in rather than building it in. Right? Testing quality in is the most expensive way to gain quality. Building quality in is how we should be doing that.

So this practice of audience-based delivery means we switch to a model, and I’m going to use the words, I’m going to use, make it sound like we’re testing in production. And in fact, that’s one of the terminologies that we do use in that context is testing in production. Right? And the reality of the world in which we live in, building these complex interconnected systems that we all build and work on, is that there is no place like production. There’s no way to simulate production. There’s no way to truly validate that what you’ve done works in production until you get to production. So wouldn’t it be better if we can get a small change quickly into production for a small set of users and then be able to increase or decrease that user set on demand so that we can validate that the product works in real-world and real scenarios? And that’s effectively what we talk about with this set of practices, this idea of shifting left and continuous delivery. And there’s a lot of practices that help with that.

So audience-based deployment model is probably the main thing. And if you’re thinking, “Oh, our product is too big and too complicated to be able to do that,” the Windows team moved to that. Windows is deployed on an audience-based model rather than a more traditional environment-based model. There’s, because there’s a physical product that’s physically deployed to people, there’s still a little bit of the old school environment in there for sure. So it’s not a complete thing. Cloud products, you can go complete, but they go, their time from cutting code to it being in production with real users is only, I think for themselves, it’s only a few hours. Like internal to the Windows team, but nightly, as I understand, at least nightly, they’re deploying new versions of Windows out to all of the participants within Microsoft.

So if you’re inside of Microsoft and you take a BG, that’s their internal IT department BG build of Windows, like you’re not self-managing BG build, then you’re getting nightly builds of Windows. Or I think many people are. That means that what the developers wrote yesterday, you’re testing today, and it’s in production. Because you’re, you know, you’re a manager in Microsoft. You’re doing your day job, which is managing people. You might be managing marketing people, right, inside of Microsoft, or managing consultants or managing whatever. And your machine has the latest version of Windows. You’re using it in production. So that’s getting into production as quickly as possible.

And then what they’re doing, the engineering team, is they’re monitoring the telemetry. This is the audience-based deployment model. They’re monitoring the telemetry and deciding whether they want to open that particular build out to more people. And when they open it out to more people, the next ring, I guess Microsoft calls them rings, ring-based deployment model, right? But it’s really audience-based. Each ring has an audience of people, and they’re all in production, and they’re just opening it out to more and more people. That’s a pretty simple version because it is a physical product that’s deployed on your machine, right? Physical, which is your operating system.

So it’s got to run on bare metal and in cloud, but it runs on metal. Right? But if you look at something like Microsoft Teams, Office 365, right? They have the ability to switch on and off features for specific users. So regardless of what build is shipped in Microsoft Teams, for example, and I have, I’m in the TAP program for Teams, basically their version of the insiders, and I get features before the general public. The, the, everybody gets features, and that can be specific to me as an individual user within my company or all users within my company. And that enables that choice. Right? So you, even as a customer in the TAP program, I can choose that I get the, oh my goodness me, the cutting edge latest and greatest, and somebody else in my business gets the reasonably stable. They’re kicking the tires, ready for moving to a wider audience, more general public audience. And then general public have a way to opt into some extra features and things.

So we’re all able to communicate with each other, right? We can all join the same call, and some people are using more different features from other people within the context of that call. Some of them have new capabilities, whole new code bases that are running their part of that story. And it’s really interesting because I do calls with folks at Microsoft, and I’ve had folks at Microsoft who are on much earlier builds than me because they’re choosing to help out that team or they work on that team. And yeah, occasionally their call drops, and they have to log back in, like, “Oh, sorry, early build, I got a bug.” And there’s a risk-benefit analysis there. If you’re working in a company and you want to take the earlier features so that you can pre-validate them for your company, understand what they are to help with training and whatnot of people in your company, understand what’s coming down the pipeline, then you can choose to do that. But you’re choosing to take a little bit of risk, right? Because it’s going to be a little bit less stable.

This is this idea of testing in production. I’m not expecting a complete crash of everything and nothing works, right? But the occasional glitch, the occasional weirdness, I’m good with that. I teach training classes on Microsoft Teams. I teach all my classes on Microsoft Teams using Microsoft Teams breakout rooms, using all those things. I’m in the TAP program. I have an earlier capability. Occasionally, things go a little bit weird for me. That’s just a teaching moment in the class because we’re talking about how we deliver software and how we deliver products. And part of that is accepting that there are going to be some mistakes. There are going to be, if you’re doing continuous delivery to production, there are going to be things that get past your automated gates, right? And end up in production. It’s what you do with that information.

That’s one of the best complimentary practices. I’m going to use the word best there. Is that philosophy and how you do it. In fact, it’s not even a practice. It’s a philosophy. You need to have a philosophy of find it and fix it. So if something does make it past into production and you’re doing continuous delivery, you need to figure out how, why, how did this get past my automated checks? And how can I change my automated checks to be able to catch those things? That’s it. If you find, “Oh, it’s not possible to change our automated checks because of the way we’ve architected the system,” then this, you might be asked, I would expect a team to be asking themselves the question, “Should we be changing our architecture so that these types of problems can’t make it into production?” And how long is that going to take?

A great example, the Azure DevOps team had a bunch of incidents where one service that really shouldn’t be mandatory took out the entire platform. Right? So they’re running an online platform, and for example, the profile service, this was their first example. The profile service stops working. Does it matter that you’re showing the ID of the user or the good of the user versus the friendly name of the user? Because the pro, you get that friendly name. I’ve got the good ID. I get the friendly name. But what if that profile service is down? Would you rather your entire system was down or it showed a good in place of a username in some cases? Right? Some small number of cases, I’d rather it showed the good and the system still worked because then my users can still do their job. My users can still use the system. They just see a small controlled glitch. Right? And then when that profile service comes back up or we fix it, that turns back on again.

And there’s a pattern, a coding pattern called the circuit breaker pattern. And it’s exactly what you think it is. When one of the services stops working, it breaks the circuit. And then every so often, it tries the circuit to see if it’s back up. And if it’s not up, it just breaks the circuit again and then waits a little bit longer and then tries the circuit. If it still doesn’t work, it breaks the circuit and waits a little bit longer. So this service on this site is not down because it can’t connect to this service. And the Azure DevOps team had this problem that the profile service took out the entire system. So millions of developers all over the world were unable to look at their code, do their work items, do all these things because the friendly name couldn’t be displayed. I mean, a bit factious with that, but the profile service was down. That’s insane.

So one of the practices that you need to think about is one of the, let’s call it philosophy. The philosophy you have to think about is, “No, we need to change it. We need to look at the impact to our users and make decisions based on our ability to maintain our service, maintain high levels of quality, maintain the ability for people to continue to work within the context of our product even when the unavoidable happens, which is systems are going to break, systems are going to be down.” How do you cope with that? That’s probably, if I was to say there’s a best practice, it’s not a best practice. It’s maybe a best philosophy, and that’s to continuously seek to better your product, better its ability to support its users, and do that continuously and relentlessly.

Connect with Martin Hinshelwood

If you've made it this far, it's worth connecting with our principal consultant and coach, Martin Hinshelwood, for a 30-minute 'ask me anything' call.

Our Happy Clients

We partner with businesses across diverse industries, including finance, insurance, healthcare, pharmaceuticals, technology, engineering, transportation, hospitality, entertainment, legal, government, and military sectors.

Bistech

Workday

CR2

Jack Links

Akaditi

Xceptor - Process and Data Automation

Hubtel Ghana

New Signature

Flowmaster (a Mentor Graphics Company)

Alignment Healthcare

Qualco

Big Data for Humans

Microsoft

Lean SA

Brandes Investment Partners L.P.

YearUp.org

Boeing

Teleplan

New Hampshire Supreme Court

Royal Air Force

Nottingham County Council

Washington Department of Transport

Washington Department of Enterprise Services

Department of Work and Pensions (UK)

SuperControl

Trayport

Illumina

Slicedbread

Kongsberg Maritime

MacDonald Humfrey (Automation) Ltd.