Mastering Site Reliability: Insights from Azure DevOps on Building a Resilient Live Site Culture

Published on 4 June 2020

Written by Martin Hinshelwood

4 minute read

Link copied to clipboard!

In my journey through the world of software development and site reliability engineering, I’ve come to appreciate the delicate balance between engineering and operations. Today, I want to share insights from my experiences with the Azure DevOps team at Microsoft, particularly how they foster a live site culture that prioritises reliability while delivering value to customers.

The Importance of Site Reliability Engineering

Site reliability engineering (SRE) is not just a buzzword; it’s a critical discipline that ensures systems are robust and resilient. My work with various clients has shown me that operational needs are just as vital as engineering requirements. The Azure DevOps team exemplifies how to create a culture that supports both.

Key Elements of a Successful Live Site Culture

Transparency Builds Trust:
- The Azure DevOps team prioritises transparency with their customers. They publish detailed post-mortems after outages, outlining what went wrong, the steps taken to resolve the issue, and how they plan to prevent it in the future. This level of openness fosters trust and reassures customers that their concerns are taken seriously.
Telemetry is Essential:
- Collecting and organising telemetry data is crucial. The Azure DevOps team has developed a comprehensive telemetry pipeline that allows them to monitor system performance and user interactions. This data is invaluable for identifying trends and potential issues before they escalate.
Automation is Key:
- In an agile environment, automation is not just a luxury; it’s a necessity. The Azure DevOps team has embraced automation to ensure that their applications are always in a deployable state. This means that when the business decides to ship to production, the engineering team is not burdened with additional work. The product owner can simply push a button.
Incident Response and Continuous Improvement:
- When incidents occur, the Azure DevOps team is quick to respond. They have a structured approach to incident management, which includes creating incident bridges with all necessary stakeholders. This ensures that everyone is aligned and working towards a resolution. Post-incident reviews are conducted to identify root causes and implement improvements.

Learning from Past Mistakes

One of the most striking lessons I’ve learned comes from the story of Knight Capital Group, which lost nearly $460 million in just 45 minutes due to a deployment error. This incident highlights the catastrophic consequences of inadequate automation and lack of back-out plans. It serves as a reminder that if you cannot successfully deploy your product, the chances of rolling back a failed deployment are slim.

Embracing Change at Microsoft

Microsoft has undergone a significant transformation over the past decade. Once a waterfall organisation, they now embrace continuous delivery, deploying updates to Windows and other products at an unprecedented scale. With over 160,000 deployments per day, the Azure DevOps team has demonstrated that agility and reliability can coexist.

Building Cross-Functional Teams

A critical aspect of the Azure DevOps team’s success is their shift towards cross-functional teams. By integrating roles such as security, legal, and operations into the engineering teams, they eliminate dependencies that can slow down progress. This approach not only enhances agility but also ensures that all perspectives are considered when developing and maintaining products.

Conclusion: The Path Forward

As I reflect on the practices of the Azure DevOps team, it becomes clear that quality and transparency are paramount in building customer trust. By automating processes, collecting comprehensive telemetry, and fostering a culture of continuous improvement, organisations can navigate the complexities of modern software development.

If you’re looking to enhance your own team’s reliability and agility, consider adopting some of these practices. Remember, the goal is not just to deliver features but to ensure that your systems are robust and your customers are satisfied.

For more insights and resources, feel free to reach out or visit my blog at nkdagility.com. Together, we can explore the evolving landscape of agile practices and site reliability engineering. Thank you for joining me on this journey!

Hello and welcome to Nikhil Agility with Martin Intuit. I’m Martin Intuit and I’m going to be talking about the live site culture and site reliability in the Azure DevOps team at Microsoft.

Okay, hey, I’m welcome. Today I want to talk a little bit about site reliability engineering. It’s something that I spend a lot of time thinking about because I have many customers who also have operational needs as well as having the typical engineering needs that they have. I think it’s important to find that balance between engineering and operations, and the Azure DevOps team have an interesting story about how they’ve managed to create that balance and create a culture inside of their organisation supporting live site and doing it with engineering teams.

So we’re going to cover quite a few different topics, and I have a lot of information here about how the Azure DevOps team do their work as well as how they interact with us, the customers. But just to give you an overview, we’re going to talk about transparency and how they manage to build trust with their customers. We’re going to talk about the amount of telemetry that they collect and how they organise that. So how open and transparent actually are they with customers and how much telemetry do they write? And then how do they organise themselves around responding to things that happen?

So there’s going to be plenty of things that happen. They do scrum; they have three-week sprints. How do they make sure that they’re still able to deliver value while also being able to get other things done as well? Get their operational work done. What if the site goes down? What if there are other pieces of work that would make the site more efficient and less likely to go down? How do they prioritise that work?

And then we can have a little bit of discussion about automation. Automation is very important for an agile story. We talk about it all the time in scrum that your definition of done at the end of every sprint, your application should be potentially shippable with no further work required by the engineering team to make that happen. So if your business decides, “Let’s ship to production,” there’s not anything else for the engineering team to do. Ultimately, your product owner can just push a button and ship to production, and that should be as much as is necessary.

And then there’s a little bit of a discussion around investigation and getting to the root cause and how to continually improve in your environment. I think that’s important as well. We talked a lot about in complex systems; it can be impossible to get to a root cause. If that were true, we would never visit the doctor because there would be no point. There are certain causes that we can identify. We may not get everything, but we can certainly do our due diligence and figure out how to get there.

But first, I have a little short story about an organisation that you may or may not be familiar with. It’s the story of how a company with nearly 400 million dollars in assets went bankrupt in 45 minutes because of a field deployment. This was a company called the Knight Capital Group. They were listed on the New York Stock Exchange, and they were implementing effectively a new order handling system that allowed them to create child orders. This allowed them to do something they hadn’t done before, but it required that they replaced the old code with new code and be able to integrate from there.

It was really nine years of application building that had gone into this system. So you can imagine if you have nine years of code in there, there’s going to be maybe a lot of spaghetti. There’s going to be a lot of difficult areas, lots of leftover things that people have just built and ignored for many years. So they did decide to repurpose an existing flag in the system in order to activate the new code. That was one of the things that they did, which potentially could have a negative result or negative impact.

At deployment time, their engineer, their technician, during the deployment only copied the new code to seven of the eight firemen. Then they flipped the switch, turned on that flag, and they went live. Because they had that fault in the number of servers that were active with the new code, the system was not performing correctly. It was doing strange things; it wasn’t able to process the orders at all that were coming through, and they started losing at just under two hundred thousand dollars per minute.

Because the system wasn’t working as expected, they then obviously were going to try and fix it. So they dropped everything; everybody’s here on fire. They came running, trying to figure it out, and they just couldn’t figure out what the problem was. For that whole day, they spent a whole day with the system down, and they couldn’t figure out what the problem was. They didn’t understand that it was just one server that was acting up. Everything looked like it was normal. You go test a server; things are good. Why is this not working? Couldn’t figure it out.

At the end of that day, they were losing, they lost four hundred and sixty million dollars, and they filed for bankruptcy protection. An interesting question you may ask yourself is, what would be the impact in your organisation of a key or critical platform that you create for your customers being down for that length of time? What can you do to prevent it?

It’s interesting; we know a lot about what happened to the Knight Capital Group because they were a publicly listed company, and you can see their SEC filing and see what happened. But they had no automation, and they had no procedures for back-out plans, no anything. So it was a massive risk for the organisation. Really, we need to think about how do you get better at doing something like that?

One of the things that I see a lot of organisations doing is thinking that they can create a back-up plan, that they can reverse the thing that they failed to do forwards and be successful at it. If you can’t be successful at deploying your product, the chances of you being able to reverse that deployment are a lot lower than your ability to deploy in the first place.

So I think it’s really a losing proposition. One of the things that I see most engineering organisations that move towards continuous delivery, move towards DevOps, move towards delivering faster is that they observe a roll-forwards mentality rather than enabling things like rollback, which are just really not viable anymore.

Microsoft has made a lot of changes over the last few years. Ten years ago, Microsoft was a waterfall organisation. No matter how much you saw tools for maybe doing agile things, they were a waterfall organisation. They were deploying Visual Studio and TFS every two years. That’s the particular department that I’m going to talk about today instead of Microsoft. Times have changed. They were deploying every two years; they were in a service pack halfway, and that’s no longer viable. It’s no longer okay to respond to feedback from your customers on those types of timeframes.

A big example of that has been Windows with their Windows 8, a massive failure of understanding the customers’ needs. There was a big disconnect and a multi-billion dollar loss therefore, basically not having tighter feedback loops. So it is important; it is something that you will need to do with your customers. Shorten those feedback loops, understand their needs a little bit better, and get things out the door.

Even the Windows team, who deploy to nearly a billion machines worldwide and have four and a half thousand software engineers working on their product, are now doing continuous delivery to production. We get it; you get it as the general public. Every 30 days, it used to be Patch Tuesday. Now it’s a whole new version of Windows Tuesday and has new features, new capabilities on that continuous cadence so they can get feedback.

They do have short releases as well; they’re not just deploying once a month. I have two machines here that are on the Windows Insider group, which I get weekly deployments as long as everything looks good. If you’re inside of Microsoft and on a corporate build machine, which the CEO is, then you get daily builds from the dev branch of Windows.

So it’s important; quality is important. They’ve actually gone from maybe deploying once or twice a year across even all of their products, going to production once or twice a year, to something that looks a little bit more staggering and over a hundred and sixty-three thousand deployments per day. That’s to any environment, but that’s an incredible figure across the organisation, and that’s with ninety-six thousand engineers. So there’s more than one deployment per day per engineer inside of Microsoft now.

Two million git commits per month, five hundred million text executions per day. This is a lot of data, a lot of things going on. In order to support that, the team that builds the product, Microsoft uses a product called Azure DevOps, which used to be called TFS. It’s also been called Visual Studio Online; it’s also been called TFS services. What else has had a bit of an identity crisis over the years?

But the Azure DevOps platform has been Microsoft’s platform of choice. It was built in order to support their transition towards this new way of working, and almost everybody inside of Microsoft is now using it to manage their work and deploy. Some of them have moved over onto GitHub for source control, but for work item tracking, for automated builds, the majority of folks inside of Microsoft, my understanding is they’re using Azure DevOps.

So it was built with that in mind, with scale in mind, with things that large in mind. This latest from Donovan Brown’s presentation has fairly big numbers. In order to support this, creating the platform that Microsoft and many other people around the world use to manage their engineering efforts, the Azure DevOps team had to create this live site culture inside of their organisation.

One of the things that they really focus on is, “You code it, you build it, you deploy it, you run it.” If you’re going to be the one that writes the code, you should be the one that gets up at three o’clock in the morning because the thing that you’ve written is not working properly. There are some caveats to that; we’ll talk about as we go through. But ultimately, the software engineers, the people writing the code, need to feel the pain of any sort of problems with either deployment or supporting the product or managing it online, or security, or all of the things that would be a problem.

So they don’t necessarily have other departments that manage these things. No, this is one of the big transitions between other big… yeah, transitions, the right word, but the big flip for organisations is that change from being a predominantly departmental-based organisation towards being cross-functional delivery teams who are able to take an idea and get it all the way to production without needing to be dependent upon external teams.

That’s really important for that transition towards a greater degree of agility because as soon as you’re dependent upon somebody else outside of your organisation that has different motivations, different priorities, then you’re not going to be able to get things done very quickly because at some point you’re going to have to wait for them. We have to remove those wait times, so we bring rather than how we’re going out to an external department, we bring that department into the team.

So there’s representation on the team of security, of legal, of obviously coding, engineering, test, operations, and all of those ideas, those hats, those roles are represented inside of the team. You need to automate; you need to learn and share with each other all the time. We need to be getting better at what we do constantly, and we need a lot of data to help us figure out what’s going on.

Anybody that manages our live site will know that live comes first. No matter what you’re doing, if production is down, you’re going to have to drop everything you’re doing and go figure that out. So how do we make sure that we support that in a way of organising people? This is about how the Azure DevOps team have managed to do that.

This is an example, a story of a way that you can do that. Don’t just hate the way they’re doing things; implement it yourself. You want to see what works for you in your organisation, within your organisational constraints, within your application and platform constraints as well. This particular team is building our web application, but there are other teams inside of Microsoft that use a similar process in order to manage desktop as well as Windows and other types of application models.

So you need to create a culture within which this works. Live site is about bringing the right people together from both app and platforms so that we can continue and keep our system up to have the lovely things that we as users of the system really want. But why is it so important to keep everything up?

There are great examples of how those things have been difficult for organisations in the past. Visual Studio Online, which was one of the names for Azure DevOps, has had outages. It’s not possible for a system not to have outages; it’s about how you deal with them when they happen. Visual Studio Online has been down; Amazon has been down. There’s a start-up with 25 million in funding in crisis because an employee deleted the wrong files. These things happen.

We want to do our best to prevent those things from happening. We want to do our best to respond to those things when they do happen, and we want to do our best to make sure that things that do happen don’t continue to happen. That’s how we build trust with our customers because really, we’ve got to be able to deploy new features.

That’s the ultimate battle between engineering and operations, that idea of engineering is measured by the number of fantastic features they get into production, but operations is measured by uptime. Generally, those things are antagonistic. If you want to keep it up, you don’t deploy new features. If you want to deploy new features, you’re not going to be able to keep it up.

So there’s a fight there, but when we bring those two groups together, they can have a real conversation about how they can do both of those things and still be able to delight their customers, have high levels of uptime, have fantastic features, and be able to move forward. That’s really what I want to talk about as we go through this today.

So first thing I want to talk about is transparency. How does this team build transparency with their customers? I’m going to kind of split trying to split between good practices, complementary practices that you can use with your scrum team or your Kanban team as part of your agile practices, as well as what this team specifically has done.

It’s why I’m going to show some of the details and governance of the work that they’ve done around that. We need transparency to build trust with our customers. We need to be understanding of things when they happen, as well as doing our best to understand how and why they happened so that we can figure out what to do about it.

Customers are not happy if you just tell them the system was down, we rebooted the server, and now it’s back up again. Why was it down? Is it going to go down again? We no longer have a level of confidence in your ability to maintain that system if you don’t even know why there’s a problem or why it’s going down.

So we need to understand that and build transparency to build trust with our customers. This is an example of some of the output you will get from the Azure DevOps team. You’ll be able to go find on the Visual Studio… well, this is from the Visual Studio Team Services blog. It’s the same blog; it’s just been moved around a little bit. But it’s the Azure DevOps blog, and they do a full post-mortem, and they publish all of the data.

They publish it to show, and here, if you maybe… I can… well, that’s not what I wanted to do. I wanted to zoom in. There was my magnifying glass. There we go. They show the data when things went bad, what they’ve investigated in order to understand what went bad, how they’ve managed or tried to mitigate it, the actions they took, and the timeline that that went through.

So it’s a really powerful story there, and they also publish an idea of what they’re going to go do to fix it as well. So if I zoom it… there we go. They’re going to make a bunch of commitments to their customers to improve the service to make sure that that type of problem, those outages, don’t happen again. For every outage of Azure DevOps, big and small, you will find these types of posts and this type of data.

The smaller the outage, the less impact the outage, obviously, the less effort you want to spend in some of these areas. But when the whole system’s down, you’re going to see a post, and that looks like this with a bunch of data in it. The leadership of the Azure DevOps team is adamant that they want to create that transparency.

This is a post from a reply that Brian Harry, who was the product unit manager for that department, really instigated those ideas of moving to the cloud and empowered that part of the organisation to make the changes that resulted in them becoming an agile organisation, delivering to production at least every three weeks. You can see the type of transparency that he’s trying to create. He’s saying, “You know, we’re not going to be successful 100% of the time. That’s just reality. We have to accept reality. There’s no such thing as a hundred percent uptime; it’s just not possible, no matter what folks say. You have to have downtime for maintenance.

We have to have a little bit of wiggle room; you can’t be 100 percent up. But how do we make sure that we minimise the amount of downtime that we have? Well, there’s lots of things that we can do. The first is to communicate, letting people know that there’s some sort of problem, that there’s an issue. This is an application that the Azure DevOps team have created. In 2013, it was just a spreadsheet, and it was 45 minutes was their time to notify their customers.

In 2017, that’s down to 15 minutes. Within 15 minutes of that red button getting pushed, understanding that there’s a problem, there’s something up on the status page, there’s something up on the blog. For the internal Microsoft customers and the key stakeholders, there are emails going out to the people that need to know. I have a couple of customers who are in the top 20 largest customers on Azure DevOps, and they get notifications almost before they know that there’s a problem, which is really powerful for them as well to then be able to have the real discussions they need inside of their organisation.

Yes, we’re aware of the problem; Microsoft’s working to fix it. Those things build trust. If we know, we understand, we’re aware, then we can do that. In order to preempt this, you’ve got the… I remember one of my good friends at Microsoft, when they first moved to the cloud, said one of the reasons that he made his first got onto Twitter first was when they moved to the cloud because he found out from people on Twitter that the system was down before his engineering team knew that the system was down.

As software engineers, being that the customers are software engineers as well, we are particularly a whiny and complaining group of people. We are happy to point the finger when something’s not working properly, and he found out first there. So how do you create an environment within which you know first, not your customers? You don’t want your customers to know before you do that there’s a problem with the system. For that, we need telemetry; we need data, and we need to understand that data in order to create the right alerts to know what we’re supposed to do and when that problem arises.

Is that a problem we need to care about? Is that a problem that might go away? Do we just need to saunter up to that problem, or do we need to run up to that problem with everybody that we need? Figuring out what we can do to prevent incidents by looking at the data and seeing when things might be… you know, “Okay, that was a bit… that data did not look good. If we’d had more users on the system at that point in time, we would have had a problem.” What can we do to make sure that next time, whatever that data is that spikes or drops in the way that it did, how do we make sure that we get that moving?

Incident prevention and continuous improvement: how do we continually go around those loops identifying the root causes and figuring it out? The Azure DevOps team have created their own telemetry pipeline so that they can understand that data and do something with it. They have implemented it in a lot of the work since their servers are deployed to Azure. They can take advantage of some of that Azure infrastructure that’s there for everybody anyway. They don’t have any special powers in Azure; Azure treats them like customers as well, albeit pretty big customers and somebody internal that has everybody’s email address and can go look them up, but that’s a different matter.

Having monitoring agents installed on all of your servers and having a set of metrics that can help you understand what that looks like and then feed that all into a big data platform. They have built something called Cosmos, where they feed all of that data in, and then they’re able to use a query platform that they built, Koku Store, that allows them to analyse that data and look for trends, look for things that are worrying.

You can have monitors, and you can have alerts; that’s reactive. The alert triggers, you had a notification, you’re going to go do something about it. But also, is there trends that you can look for in the data analysis to see when something’s going to happen, even maybe before it does? They have access to Azure Diagnostics; they use Application Insights, which is built on top of Azure, and it allows you to, on any platform, and they have APIs for all the platforms. It’s just a set of services that collect and analyse metrics, and then they feed all of that data into an Azure data lake.

A lot of these systems, the features were built out as part of a lot of teams at Microsoft moving towards this model. A very powerful platform; I use it myself on my tools as well. Something that they really realised was that they have to gather everything. There’s no data that is unimportant. Are you going to gather data about your SLAs? You’ve got to gather data about alerts, your DevOps pipelines, your experiments that are going on. You’re going to collect data about your network performance, your platform, all of the things that are going on, even then to trace telemetry power flux. Everything is going to be collected in that way.

At the moment, or at least at the point in time where I got access to an understanding of the volume of data they were collecting, about seven terabytes of data per day on average, with people using the system and getting an understanding of all of that data: KPIs, job history, pair of characters, trace activity logs, platform and network capabilities. In order to do that, you really have to understand the customer experience telemetry.

So a user accesses the system; it goes through a bunch of tiers. It potentially accesses the database and then serves as a result back to the customer. All of your interactivity with many of these systems is user-activated, and when it’s user-activated, you really have to understand the flow. One of the things that the Azure DevOps team do is they have an IDE that is generated for every call into the system that they pass all the way through down to SQL Server, and they collect performance and trace across the board.

There’s actually a front-end flag that you can turn on, and I’ll maybe try and look it up or speak to one of the teams to see where you turn that on again. I can never remember. Oh, underscore diagnostics on. Let’s see if I can figure it out.

Let me just switch to… I have a lot of Azure DevOps that I have access to, a lot of different platforms and systems, so I’m just going to bring that up. It is just taking its time to load, so maybe I stick it over here and do it. Will you? That once it’s loaded? Oh no, I have to hit switch at the bottom. Then you’ll love it when you’ve got a tall screen and the button’s way at the bottom. That’s a user experience.

So they’re able to collect that telemetry, that pair of telemetry across their entire organisation. So let me go into my migration tools setup. Let’s see if I can remember how to do this. I’m pretty sure it’s underscore… oh, is it underscore diagnostics? It is. That’s what happens when you’ve been working with the team for too long.

So if I turn on diagnostics and perf bar, and you can do this on any of your accounts as well, I just want to show you what it is, the kind of capabilities that they’re collecting. You can see at the bottom of the screen, it’s very small. Can I pop a mic? Can I pop a magnifying glass? There we go.

They are collecting the trace and time and things that have happened throughout the entire system, all the way through every web service down to every database call, every plugin, so that you can go analyse that. If you’re one of the developers on this team, you want as much data as possible to go figure that out. You can see at the bottom there’s actually a little smiley face there. I don’t know if you can see that all the way down there. A little smiley face means it met the KPI for how long pages are supposed to load. If you have a smiley face, if you have a sad face, things are not going so well, and it says all is good; all the checks are working out.

That’s one of the ways that the team are able to leverage that. For some reason, I have lost all of the buttons on my office, so let me just close that and reopen it so I can get back showing you. There we go; I’ve got a pack. Don’t you love it when you find where we were about there?

It’s like, “Ah!” So they have this ability to correlate everything across the board, end to end, and have that set up and working. We can see all the dependencies; they can get all of the telemetry data and set all of that up. It means that they can check, have metrics and trend analysis to see what’s going on. When something goes below the trend that they want, they can dive in and see exactly when things happen, what the problem was, how many people were affected, and be able to take to get that pair of access to that extra information that they have.

I’m getting that extended down all the way to their dependencies. If they are dependent upon the SQL Server and the SQL services, then they’re able to get all of that as well and see the dependency activities. The amount of data that they collect is unbelievable because they need to keep a system up with all of the software engineers that are users of this system, checking in code. Think about that; if Azure DevOps goes down, all of the millions of people that use it and all of the companies that have their software engineers reliant upon it, they’re all down; they’re all unavailable. It’s not a good place to be.

You need to be able to go deep on the telemetry and really understand what’s going on at every level. They do have activity logs schema. I’m just going to go through this very quickly. I am happy to provide the slides out to anybody who wants to give me a shout; I will do that. But they collect all of their telemetry across the board, really understand, and that tool that I just showed… oh, there we go; there’s the URL. I did not even know it was in here, but they surfaced that data to the developers so that they can see it and get access to it.

They also aggregate it across the entire service, so they get an insight as to what’s going on holistically, but also per customer as well. If you’ve got everybody at a particular customer that’s down, then we need to go deal with that. It’s not okay to say we have 5,000 customers and one customer is down, so that’s not so bad. But one customer is down; everybody at that one customer cannot do the job that they do on the service.

You need to make sure that you can see that as well and see it in real-time so that you can do something about it and get a true understanding of what’s going on. I think that’s super-duper important.

Okay, and that idea that customers are a bag of sand, I love that one. Every grain of sand is a customer, and every grain is important. If somebody is having a problem, they’re going to tell ten of their friends that they have a problem. You can see here they’re doing an impact threshold to understand when do we need to have a bigger conversation when we’re in breach of SLA so that we can go do something about it.

But also, the team, I know that they go back over this data to see even the smaller blips when people had problems or performance problems. Let’s go talk to the customer, figure out what the problem was, and what can we do to go and fix that up. You’ve got to find a balance between the amount of noise, the amount of data you get, and having the alerts to the developers. You’ve got to be able to push new features at the same time.

If everybody’s interrupted all the time, so trying to minimise those alerts so that they’re only the things that are important. You’ll know if you get an email that is… you get a hundred emails from one service, you’re going to either turn that service off, turn the emails off, or create a rule and shove them in a folder somewhere where nobody will ever see them.

So that’s really important, balancing that noise to signal. Then you also need to be able to respond as quickly as possible to the problems that come up. You can’t have the development team chucking that code over the wall, over the fence, to the software engineer, the operations team, because what they offer up, what’s going to happen is the operations team are going to just bolt on some type of monitoring that they can do.

Whereas if engineering, the dev team, is building out the monitoring, they are able to add deep holistic monitoring into their application and have a much better experience for the engineers out trying to figure out the problem. Operations who are trying to support and maintain it. So they built around this idea of SRE, bringing DevOps and SRE together.

They used to have developers, coders, testers, and operations as completely separate teams. Now they’ve got a combined engineering team, which has feature team engineers, which are your traditional coders and testers. That’s bringing that group together, and that’s your feature team engineers. But also, on the same team, they have live site engineers.

They have two pieces of that story. I’ve got it here in blue and purple. They have a feature team live site engineer, which is a rotating role for the feature team engineers. So at least one of the, if not two, of the feature team engineers for a sprint will be designated as a feature team live site engineer. That live site engineer is not going to be looking at value work, i.e., your traditional sprint backlog. They are solely going to be looking at work to help improve the stability of the platform, maintain and manage that.

They also have access to site reliability engineers who are dedicated site reliability engineering folks and that have deep expertise in platform and monitoring to leverage. They’re all part of the same team. You have a combined engineering team that includes the rotating roles as well as the dedicated live site SRE. Site reliability engineering is an important concept to create a world in which we can get very quick delivery to a platform that’s up as much as possible.

We want to automate as much as possible, but there are always going to be things that we have to do. You might have some kind of compliance, security; those are all things that we have to do. This is the pyramid of things we must do, up to the most valuable things at the top. Product contributions are at the top, and a live site engineer, that feature site live site engineer and site reliability engineer are going to be working together and doing stuff at the bottom of this pyramid.

Then flowing, as those things are complete, up to the top of the pyramid. The last thing they do is product contributions when everything else is complete. That’s a powerful story because they have to be able to respond to allowance. Those are the folks they assign feature live site engineers. They are the ones that get woken up at 3 o’clock in the morning to go deal with that problem.

The Azure DevOps team have built, and I don’t know what you call it, an alerting system and auto-routing system with different ways those alerts can be triggered. Social, human, that’s a support call alerting and customer support, and it goes into a rules engine and then allocates that to the combined engineering team that is accountable for that part of the system, that part of the code.

Then they will do an impact assessment and decide what they want to do, how they’re going to respond to this problem. Are they going to do a generic business errors investigation? They’ll put it on the backlog; we’ll get to it in the morning when everybody wakes up. Or are we going to have to wake a bunch of people up and do the 24/7 live incident?

When the live incident happens, they’re going to create a bridge, an incident bridge, with all the people on it that they need to go figure that out. The feature team live site engineer is on there; they’re the allocated person on the engineering team, as well as the SREs. But maybe they need to bring in additional partners. If SQL Server is operating slow, then maybe they have to bring in somebody from Azure. Executive leadership might be brought in; there may be an incident manager. There are all sorts of things that are triggered in order to make those things happen, get everybody together, and get a resolution to the problem.

That idea of having those live site teams inside of a software engineering team, they have slightly bigger teams than you would traditionally see for scrum teams. Scrum teams are traditionally between three and nine. They do between ten and twelve people. They have two people on the team that is the live site team, so they have a feature team which are looking at the sprint backlog and dealing with those issues, and then the live site team only deals with live site issues and interruptions.

So they’re shielding the team from those potential disruptions while still working on things that would add value to the product but from the perspective of live site issues. So mitigating things for future live site issues. That group has a rotating group; it will be a different group each sprint.

So again, they have priorities. If there’s a live site incident, that’s their highest priority. If there is no live site incident, then they’re going to be looking at past live site mitigation tasks. So that could be things that enable us to get better at live site, or it could be things that we’ve identified as being a problem that really end up in that category of technical debt. Things that are not automated, things that are not quite in a way they need to be, and do that.

Then they’re going to look at improvements in monitoring, telemetry, and alerts way before they look at adding any new features. That is their focus. The team that works on this, they do three-week sprints. During sprint, for example, during sprint 1:24, the deployment is ongoing for sprint 123. It takes a long time to do a deployment on a large system like this. It takes more than a few days; it takes more than a week, to be honest, to get that deployed.

The live site engineers are working alongside the feature team, but they’re kind of working at loggerheads there. They also need to be able to manage things at scale, and the only way to do that is to automate. We need to automate everything; we need to automate as much as possible. People make mistakes; that’s one of the reasons to automate. People need to go to sleep, and people forget things.

Think about that engineering team at the start for the Knight Capital Group that neglected to deploy to one of the servers. People forget things; people make mistakes. Automated processes can tell us when they fail, when they don’t get everything done. So being able to troubleshoot those ideas, if we’re automating, we can have alerts that find things much quicker and realise that something’s wrong.

Once you start mitigating your problems, you might have some manual mitigations that you have to do. That would be the first thing that might happen. But if something like that happens often and you have the same mitigation task, you can get it to be automating the mitigation as well. Have the automated engine let that switch and then notify the engineers that there’s a problem they need to go fix.

But you’ve already mitigated the problem. That doesn’t mean we don’t have work to do. Because we’re monitoring the health all the time, we’re understanding when those mitigating tasks have been activated and how do we get better at not having those problems in the first place? That’s really important.

Getting to the root cause is important. Doing some kind of post-mortem to really understand what can be done to make our product better, to improve quality so that we don’t have those issues in the future. Each team, each feature team, has their own goals and measures and repair times that they can monitor and see what’s happening and be able to respond to change more quickly.

How do we know when the team is being good at responding to change? They should be repairing problems that come up quickly; they should be having less problems over time. These are indications that we can look at to see.

So kind of in conclusion, we need quality and transparency to build customer trust. We need to have full transparency so the customers see everything that’s going on. We taught them about the problems; we helped figure out how to make them better, and we make sure everybody’s involved. We collect as much telemetry as possible so that we can have better insights into what’s going on.

We need to organise around responding more quickly. So that’s with alerts, with coal chains, with understanding who do we have to go deal with, who do we have to go wake up. The SREs and engineering team, live site team inside of the engineering team is what allows for that for this group. Automate everything, as far as I’m concerned. If it’s not automated, it’s technical debt. That’s something you need to go work on. You shouldn’t have anything that is not automated.

Then you have to get as close to the root cause as possible in order to continuously improve and try and figure out what those things are. Just to… and there are some good… well, I’ll make myself smaller. There we go. There are some good articles, videos there on SRE at Microsoft, how the different teams do it and what they mean by that.

If I… how do I make myself disappear? Hey, what? I’ll move over here. There we go. So you get a picture of that and go to that URL, and you will be able to download this presentation. Please feel free to get in touch with me anytime you want to. I am on Microsoft Teams on that URL. I’m on Twitter, and you can WhatsApp me and find more information on my blog. I’m happy to share presentations; all you have to do is ask.

Okay, thank you very much for listening, and I hope you’re able to come to one of our scrum classes. We deliver a number of scrum data classes, and we are delivering all of our courses and material as live virtual classrooms at the moment. So please go take a look at nkdagility.com.

Okay, thank you very much. Like and subscribe.

Events and Presentations Software Development Operational Practices Pragmatic Thinking

Link copied to clipboard!

Connect with Martin Hinshelwood

If you've made it this far, it's worth connecting with our principal consultant and coach, Martin Hinshelwood, for a 30-minute 'ask me anything' call.

Our Happy Clients

We partner with businesses across diverse industries, including finance, insurance, healthcare, pharmaceuticals, technology, engineering, transportation, hospitality, entertainment, legal, government, and military sectors.

CR2