The Well-Architected Framework: Operational Excellence

In the past years or so both AWS and Azure have published their ‘Well-Architected Framework’ focussing on best practices for cloud architectures; however, these frameworks are much more than just cloud-focused and are a good reference for general good architectures. This article explores these frameworks as a more general set of measurements for good architecture.

The Six Domains

The frameworks published by Azure and AWS consists of a board set up measurement categories from operational excellence to sustainability, as shown in the diagram above. The newest of these categories recently added is ‘Sustainability’ which focus on ensuring environmental impact is limited by evaluating the available execution regions and scaling resources to fit the loads.

An important aspect that has always been seen as a Capex managed budget was the infrastructure aspects (compute, storage, network, etc); however, in the world of cloud, this has now become an Opex expense that has led to interesting surprises. This is a topic worth a more detailed analysis.

Ensuring operational reliability is based on ensuring reliability of the building blocks individually (networking, compute, storage) but equally importantly the ways in which the distributed system can fail. The current focus of ensuring reliability is on observability. Improvements in reliability is, it is currently argued, based on observing the system dynamics closely.

Operational Excellence

Now turning to the topic of operational excellence. To ensure operational excellence and continuous improvement there are a few topics worth discussing. It is perhaps obvious that often managers find it difficult to prioritise the various demands for resources (people, funding, infrastructure) and having a good way to prioritise these demands, maybe using potential value to the organisation or some other measurement has value in itself but can also be implemented in a demand-management process.

Next, now the organisation is organised can help with operational excellence. Ownership is at the root of where to start. All resources should have ownership, often multiple ownership categories. Applications should have a business owner; tools should have an IT owner. Within IT all processes should have ownership.

A topic that is almost never mentioned within the context of operational excellence is the way the culture is affecting excellence. It starts with leadership and flows down. Ensure that team enabled and empowered to act when outcomes are at risk. That is, the culture should support acting when experts know what to do.

The key two words are act and risk. Escalation should be encouraged, if managers are kept in the dark, they can only be reactive. At the same time, managers must embrace the practice of escalation and not shoot the messenger. The last point about culture is that experimentation should be encouraged. This is the base for deep learning.

The last topic for this discussion is the ways to improve flow, that is, improve the smoothness of the processes and workflow. This requires firstly that processes have owners, owners that monitor (not compliance, but how well the process works).

Small points around process: use version control, validate change, use build and deployment management software, have design standards, and importantly, widely distribute and discuss the standards. The last point about flow is that making small reversible changes leads to more stable system and less sleepless night.

Observability is a hot topic currently and improving the right kind of observational data can improve overall operational excellence in four ways, namely by acting as a good early warning system, by providing early issue identification, by greatly improving the issue hunting ability, and lastly, by providing post incident data to help prevent such issues in future.

Observability is not just monitoring real-time events and workloads, CPU usage and memory levels. Observability must be an integrated approach starting with correlation IDs to be able to track events end to end through a complex environment, writing well-formatted event-logs, monitoring process performance (business processes within the system, customer journeys, etc).

The last topic that this article will touch on is that of understanding the operational health. This is a strange topic for operational excellence but knowing the KPIs/scorecards/OKRs are important for operational excellence. If Operational excellence is measured, and the operational health is understood, then that forms the foundation for continuous improvement.

The Well-Architected Framework that was put forward by AWS and Azure is a well-thought-out set of measurements that can contribute not only to cloud architecture operations but can be applied wider to improve happiness and longevity, as a by-product of improving operational excellence. In summary, manage demand, assign ownership, enable action on risk, improve the flow, holistic observability, understand operational health. Profit!