https://nkdagility.com/resources/VThLnxVapgJ copied to clipboard!

Blog Ethos DevOps Engineering Excellence Product Development

a·gen·tic a·gil·i·ty

How to Build for Business Resilience and Continuity

TL;DR; Building business resilience requires intentional design, strong observability, and aggressive decoupling so failures do not cascade across systems. Empower teams to act quickly, treat deployments as routine, and design for fast recovery using practices like chaos engineering and circuit breakers. Make resilience a core part of your culture and operations, not a one-time project, and use real metrics to guide continuous improvement.

Published on 26 May 2025

Written by Martin Hinshelwood

5 minute read

https://nkdagility.com/resources/VThLnxVapgJ

Business resilience is not an accident. It is the deliberate outcome of intelligent systems design, pragmatic decision-making, and organisational discipline. If you want resilience, you must build for it, upfront, consistently, and aggressively.

Here is a pragmatic checklist for engineering true business resilience and continuity:

Observability and Telemetry First

You cannot manage what you cannot see. You cannot fix what you cannot detect.

Embed telemetry at every level: application, infrastructure, business processes.
Define service level objectives (SLOs) for your critical systems and actually measure against them.
Monitor leading indicators, not just trailing failures.
Establish a live site culture, not a “we’ll find out when customers call” culture.

If your systems are invisible until they explode, you are not resilient; you are negligent.

Decouple Systems Aggressively

Coupling is a time bomb. When one piece falls, everything else falls with it.

Bounded contexts are non-negotiable. Embrace them.
No logic in the data tier. Databases store data, not behaviour. If your business rules are locked in SQL, you are one outage away from a complete operational collapse.
Avoid shared databases. Duplicate data if necessary. Loose coupling beats data purity.
Prefer asynchronous messaging. Synchronous systems are brittle under load and fail catastrophically.

Resilience comes from isolation. Systems must fail independently, not cascade like dominoes.

When the User Profile Service takes out the entire system

For a long time I have worked with the Azure DevOps teams at Microsoft as an strategic customer and MVP and I have witnessed this lesson firsthand. One of the major outages of Azure DevOps was triggered by something that, at first glance, seemed trivial: the Profile Service. When the Profile Service went down, developers could no longer commit code, and product owners could not update backlog items. Why? Because the system could not resolve your friendly name from your authenticated ID.

The service was so tightly coupled into critical user flows that its failure crippled the entire platform.

In response, the teams created “live site incident” repair work and moved the Profile Service behind a circuit breaker. If the Profile Service went down again, it would degrade gracefully, not drag down the entire experience.

As an anecdotal aside, a few months later another unrelated service failed, and, unsurprisingly, it also took down large parts of the system. That was the final straw. The teams went on a full-scale mission to introduce the circuit breaker pattern across every service, making sure no single point of failure could collapse the platform again.

Decoupling and graceful degradation are not academic exercises. They are mandatory if you value continuity.

Treat Deployments as Routine, Not Special

Every deployment is a practice run for disaster recovery. If deployment is a risky, complex, orchestrated event, you have already failed.

Implement Continuous Delivery (CD) so that deployments happen safely, frequently, and predictably.
Use feature toggles to separate code deployment from feature release.
Automate rollbacks. A failed deployment should not require heroics.

If your organisation fears deployment day, it is structurally fragile.

Empower Teams to Act Without Hierarchy Paralysis

In a crisis, the last thing you want is a command-and-control bottleneck. Empowerment is a precondition to survival.

Pre-delegate authority for critical systems response.
Train teams on incident management procedures, disaster recovery, and failover operations.
Decentralise decision-making to the people closest to the work.

In crisis, minutes matter. Top-down control costs lives and revenue.

Assume Everything Will Fail; Design to Recover Fast

Hope is not a strategy. Failure is inevitable. Recovery speed determines survival.

Chaos engineering is not optional; it is responsible practice.
Design for graceful degradation. Partial failure is better than total failure.
Practice recovery drills. Don’t just have a DR plan; rehearse it until it is boring.

If you are not recovering faster than your competitors, you are losing.

DevOps, Site Reliability Engineering , and Evidence-Based Management

Business resilience is DevOps in action: the union of people, process, and products to enable continuous delivery of value to end users. Resilient systems emerge from the daily discipline of CI/CD, Infrastructure as Code (IaC), and monitoring as first-class citizens.

It is Site Reliability Engineering (SRE) lived, not aspirational. SRE teaches us that availability, latency, performance, efficiency, change management , monitoring, and emergency response are all product features, just as important as the user-facing ones.

It is Evidence-Based Management (EBM) made real. Metrics like Mean Time to Recovery (MTTR), Deployment Frequency , and Customer Satisfaction are not vanity measures; they are survival metrics. They inform whether your investment in resilience is paying off or just theatre.

Resilience is not a project. It is an ethos. You must architect it into your systems, invest in it continuously, and operationalise it ruthlessly.

Otherwise, you are gambling with your business and calling it strategy.

Connect with Martin Hinshelwood

If you've made it this far, it's worth connecting with our principal consultant and coach, Martin Hinshelwood, for a 30-minute 'ask me anything' call.

Our Happy Clients

We partner with businesses across diverse industries, including finance, insurance, healthcare, pharmaceuticals, technology, engineering, transportation, hospitality, entertainment, legal, government, and military sectors.

Slaughter and May

Kongsberg Maritime

Bistech

Boxit Document Solutions

Schlumberger

Genus Breeding Ltd

Epic Games

MacDonald Humfrey (Automation) Ltd.

Teleplan

Slicedbread

Flowmaster (a Mentor Graphics Company)

Lean SA

Deliotte

Boeing

Hubtel Ghana

Brandes Investment Partners L.P.

Sage

SuperControl

Department of Work and Pensions (UK)

Nottingham County Council

Washington Department of Transport

New Hampshire Supreme Court

Ghana Police Service

Royal Air Force

Capita Secure Information Solutions Ltd

Jack Links

Big Data for Humans

YearUp.org

ProgramUtvikling

Cognizant Microsoft Business Group (MBG)