Biology of Distributed Information Systems: From Digital Transformation to Service Architecture and Reliability Engineering

1. Introduction

This post talks about Digital Transformation from an operational excellence perspective. I propose a simplified book review of “the textbook” on reliability engineering as a path to hammer some key principles that make digital systems efficient. There should be no surprises with this approach, since I have been advocating for a while in my blogs that:

Digital transformation is about digital capability
Customer satisfaction is first about operational excellence
The playground has changed - the GAFA have set new standards for operational excellence, as explained by the Web Giants book.

Systems’ quality of service (QoS) is a topic that is very close to my heart. I do not consider myself a true expert since I have not spent tens of years running systems full time daily (what it takes to be a true Ops expert) but I have a long experience thinking about and observing firsthand what makes systems reliable :

I have been privileged to start my exposure to availability lectures as a teenager from my departed father Paul Caseau on topics such as queuing theory, Jackson networks and Markov processes. I was told about MTBF and MTTR from practical examples such as shoes when still very young.
I started my career as an Operations Research scientist with many years spent on scheduling, planning and routing, with the later addition of stochastic optimization.
I then got my first exposure to real-life QoS issues as Bouygues Telecom CIO 15 years ago. This resulted in my attempt to formalize what I saw in books, a new piece of simulation and research about SLA called “Optimization of Application Integration” and later the opening of this blog. The paper “Self-adaptive Middleware: Supporting Business Processes and Service Level Agreements” is very similar to what this post will describe.
I had the remarkable opportunity of a few in-depth conversations with key SRE people at Google including world-class stars that are quoted in the book that I will present today.
Last, my experience as a lecturer at Ecole Polytechnique has a helped me to formalize what I have learned from the trenches.

I plan to share a detailed summary of the “Google SRE” book because most of what I have learned and know from experience may be found in this book. On the one hand, this is a book about large-scale distributed systems and how to design and operate them in a reliable manner. This is a complicated topic and there is a wealth of knowledge to learn and share. On the other hand, the technology “lego box” has changed : things that were very hard to design 10 years ago are much easier today thanks to open source technologies such as cloud, containers, distributed orchestration, distributed storage platforms, etc. High-availability is no longer a “high end” feature and many small companies including startups build high-availability architectures with QoS performance that would have been hard to achieve for a telco 20 years ago. Many of the open source pieces that created this “reliability revolution” come from Google components that are described in the SRE book. Readers of this blog who know about “Autonomic Computing” will notice that Google has done nothing less than delivering an ambition proposed by IBM over 15 years ago. Self-monitoring, Self-optimization, Self-provisioning, Self-healing are characteristics of modern reliable systems

This is not a linear book review, because the material is very deep (the book is 475 pages long). I selected nine key topics that are closely related to digital transformation and to my experience as a CIO; then I tried to summarize the key insights from the book. Managers are not the main audience for this book, which is foremost written for practitioners. For technical readers this book is a treasure trove of insights, examples, and practical advice. However, I strongly suggest anyone to open this book and read at least the first hundred pages.

2. Google Site Reliability Engineering

The book “Site Reliability Engineering - How Google runs production systems”, edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy, is a collective book from Google SRE members: “This book is a series of essays written by members and alumni of Google’s Site Reliability Engineering organization. It’s much more like conference proceedings than it is like a standard book by an author or a small number of authors”. This heavy book is filled with case stories and technical details about the different tools that are used by Google engineers. Although the 33 chapters are mostly focused on practical issues and problems, this is also a principled book about computing: “We apply the principles of computer science and engineering to the design and development of computing systems: generally, large distributed ones”. True to the principles that I will present very soon, the book is rich with failure analysis and “anatomy of unmanaged incident” examples.

The relationship with the theme of this blog, “biology of distributed information systems” will become self-evident when you read this post. As mentioned earlier, the ambition of “autonomic computing” is also deeply embedded: “The global computer … must be self-repairing to operate once it grows past a certain size, due to the essentially statistically guaranteed large number of failures taking place every second”.

2.1. Information System Reliability is a Strategic Imperative

The book starts with the observation that the real world of information systems is chaotic, both because of the size and the complexity, but also because of the astounding number of changes that affect the systems daily. The first chapter is written by Ben Treynor Sloss, Google’s VP for 24/7 Operations, originator of the term SRE, who claims that reliability is the most fundamental feature of any product. Ben Treynor is the father of SRE: “SRE is what happens when you ask a software engineer to design an operations team. When I joined Google in 2003 and was tasked with running a “Production Team” of seven engineers, my entire life up to that point had been software engineering”. SRE is the story of a quest towards reliability through automation and simplification. He quotes C.A.R. Hoare in his Turing Award lecture: “The price of reliability is the pursuit of the utmost simplicity”.

SRE is the brainchild of a scientist and engineer who looks at failure and reliability with cold unemotional eyes and tries to maximize the outcome while reducing the effort. By construction it is a “DevOps” approach where the skills of software engineering are applied to improving operations. Obviously, this approach is draws from the scaling and size issues experienced at Google over the years: “Ensuring that the cost of maintenance scales sublinearly with the size of the service is key to making monitoring (and all sustaining operations work) maintainable”.

SREs (Site Reliability Engineers) work with the complete product team, from product owners and development teams to operations specialists and partner. Following engineering practices, business and users goals are translated into explicit and measurable goals that can be “engineered”. With no surprise to readers of this blog, key metrics that are used throughout the book are availability, latency and throughput : “User-facing serving systems … generally care about availability, latency, and throughput”. As an engineer, one knows that a price must be paid for any complex requirement, including performance and reliability. Failures and mistakes are managed in a cold engineering approach, without exaggerating the requirements. In many places, the book warns us against looking for “over-safe, over-expensive” approaches, but to keep looking for a balance.

2.2. Distributed Systems

As mentioned in the introduction, this is a book about distributed systems: “As SREs, we work with large-scale, complex, distributed systems.” Distributed systems engineering is a wonderfully exciting discipline, albeit a difficult one. It takes time, energy and dedication to become an expert in such matter. It also takes humility and curiosity. “Systems are complex. It’s quite likely that there are multiple factors, each of which individually is not the cause, but which taken jointly are causes. Real systems are also often path-dependent, so that they must be in a specific state before a failure occurs.”

The emphasis on “distributed” here means that one must learn to step back and look at the whole system, not the platform or the component that is actively being built. One of my key action as a manager is to promote the “One System” culture, which is the understanding that we (as an organization) are building ONE large, complex and distributed system. This requires observing and sharing : “There are many ways to simplify and speed troubleshooting. Perhaps the most fundamental are: Building observability — with both white-box metrics and structured logs — into each component from the ground up. Designing systems with well-understood and observable interfaces between components”.

The emphasis about logging and configuration management is obvious throughout the book. There are far too many practical recommendations, such as the use of different verbosity levels, to be reproduced here in this summary. Some of them I learned many years ago at Bellcore, when working with true distributed system experts such as my friend Quoc-Bao Nguyen. Being able to debug, re-parameter, re-configure without shutting the system down has been a good practice for decades. On the other hand, distributed issues that used be very hard to solve, such as distributed locking, are now easier because of the wealth of scalable and robust open-source solutions. Still, one must understand the complex nature of distributed system and learn about tested protocols: “Whenever you see leader election, critical shared state, or distributed locking, we recommend using distributed consensus systems that have been formally proven and tested thoroughly. Informal approaches to solving this problem can lead to outages, and more insidiously, to subtle and hard-to-fix data consistency problems that may prolong outages in your system unnecessarily”.

The same remark can be made about the complexity of distributed large-scale storage. One must understand and accept the CAP theorem and learn to live with eventual consistency (or learn to live with right-time versus real-time) : “A growing number of distributed datastore technologies provide a different set of semantics known as BASE (Basically Available, Soft state, and Eventual consistency). Datastores that support BASE semantics have useful applications for certain kinds of data and can handle large volumes of data and transactions that would be much more costly, and perhaps altogether infeasible, with datastores that support ACID semantics”.

As mentioned earlier, the book is full of examples of self-adaptive mechanisms to make distributed systems more reliable. Techniques that are heavily used in network management such as exponential decay have obviously their place at the application level. Throttling is a good example of self-adaptation: “Client-side throttling addresses this problem. When a client detects that a significant portion of its recent requests have been rejected due to “out of quota” errors, it starts self-regulating and caps the amount of outgoing traffic it generates. Requests above the cap fail locally without even reaching the network”

2.3 Distributed Systems Reliability at Google

Google has developed a wealth of know-how on distributed systems reliability. The goal of this book is to share some of that knowledge because it is widely applicable in a context larger than Google or the web community. Reliability is not a set of simple recipes; it is a discipline of redundant protection through multiple layers: “Given the many ways data can be lost (as described previously), there is no silver bullet that guards against the many combinations of failure modes. Instead, you need defense in depth. Defense in depth comprises multiple layers, with each successive layer of defense conferring protection from progressively less common data loss scenarios”. Some of the techniques, such as idempotent scripts, are now well understood in the DevOps community, but this book offers a comprehensive survey which should be useful to most of us who are not true geeks.

The model of the SRE team is to build a small team of highly trained ops specialists with enough software engineering expertise to understand and manage the systems: “Ultimately, SRE’s goal is to follow a similar course. An SRE team should be as compact as possible and operate at a high level of abstraction, relying upon lots of backup systems as failsafes and thoughtful APIs to communicate with the systems” … “In order to work at scale, teams must be self-sufficient. Release engineering has developed best practices and tools that allow our product development teams to control and run their own release processes”. This dual DevOps capability is critical to perform “ProdTests” (tests on the production environment). A SRE team should be autonomous in its decision process, but it works together with the rest of the organization (SRE teams supplements the engineering organization, it does not replace it).

How does one develop these highly trained specialists? By doing and mostly by learning from previous failures: “There is no better way to learn than to document what has broken in the past. History is about learning from everyone’s mistakes. Be thorough, be honest, but most of all, ask hard questions”. The customer focus is critical and visible throughout the book. SRE must understand the user perspective and take extra care of customer-facing systems : “The frontend infrastructure consists of reverse proxy and load balancing systems running close to the edge of our network. These are the systems that, among other things, serve as one endpoint of the connections from end users (e.g., terminate TCP from the user’s browser). Given their critical role, we engineer these systems to deliver an extremely high level of reliability.”

One key responsibility of SREs is to assist release engineering, that is, how to test and release a new version of a software component or service. There is a lot of emphasis on gradual releases, which goes hand in hand with continuous release: “Almost all updates to Google’s services proceed gradually, according to a defined process, with appropriate verification steps interspersed. A new server might be installed on a few machines in one datacenter and observed for a defined period of time. If all looks well, the server is installed on all machines in one datacenter, observed again, and then installed on all machines globally.”

2.4 Automation and Monitoring

As the book points out, the simplest and most powerful principles behind reliability engineering are automation and monitoring. Automation is the natural solution to the efficiency/sublinear requirement made earlier, and the best way to remove a large part of human errors which are still the roots of most system failures. “There’s an additional benefit for systems where automation is used to resolve common faults in a system (a frequent situation for SRE-created automation). If automation runs regularly and successfully enough, the result is a reduced mean time to repair (MTTR) for those common faults.” The history of SRE, as told by Ben Treynor, comes from the automatization of operations by software engineers. Automation is a continuous task, not a once-for-all milestone. “Automation code, like unit test code, dies when the maintaining team isn’t obsessive about keeping the code in sync with the codebase it covers.”

Operations need to be automated and monitored. Monitoring is, as told earlier when we talked about the necessity of observation, the heart of reliable operations. “Whether it is at Google or elsewhere, monitoring is an absolutely essential component of doing the right thing in production. If you can’t monitor a service, you don’t know what’s happening, and if you’re blind to what’s happening, you can’t be reliable.” Monitoring and automation go hand in hand. Alerts produced by monitoring should trigger actions which are as automated as possible – back to the autonomic computing ambition – human intervention should be seen as the last resort option. Automation should come with heavy instrumentation to feed the monitoring and ensure proper execution (Automation is no guarantee against human errors, at first – only continuous improvement produced fail-proof scripts). “Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system.”

The regular practice of logging and monitoring provides massive opportunities of using Artificial Intelligence and Machine Learning for operations. Using algorithm ranging from classical time series data mining (e.g., with Splunk) to more advanced machine learning techniques to detect correlations and patterns, one may generate advanced alerts when the system requires tuning to avoid entering a hazard zone in the future. Automation and monitoring act on the two components of reliability: reducing MTBF (less error with automation, early detection and prevention with monitoring) and reducing MTTR (better diagnosis with monitoring and faster reparation through automation).

2.5. The Most Common Cause of Loss of Service Availability is Change

The SRE team at Google has found that roughly 70% of outages are due to changes in a live system: “Most outages are caused by some kind of change — a new configuration, a new feature launch, or a new type of user traffic — the two teams’ goals are fundamentally in tension”. This tension between the need for change (and for continuous delivery) and the need for reliable operations is wonderfully covered by the great book “Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations”, by Nicole Foresgreen, Jez Humble and Gene Kim. The only path forward is to automate change and treat it as seriously as possible at the same time. Companies that succeed to deliver high-availability reliable quality of service are those who perform frequent changes.

API management is a core dimension of modern system engineering, including from an operations perspective as is illustrated many times in this book. API modularity is critical for keeping systems simple. API change management is equally critical and helps to understand while reliability engineering and software engineering are mutually dependent: “While the modularity that APIs offer may seem straightforward, it is not so apparent that the notion of modularity also extends to how changes to APIs are introduced. Just a single change to an API can force developers to rebuild their entire system and run the risk of introducing new bugs.”

A key component of change management is Capacity planning. This topic is covered in depth in this book, with a systemic vision (not only one platform as a time, but with a global perspective). “Good capacity planning can reduce the probability that a cascading failure will occur. Capacity planning should be coupled with performance testing to determine the load at which the service will fail.” System capacity planning is a complex modeling and engineering task that may leverage techniques from operations research (cf. the Auxon solver that formulates a giant mixed-integer or linear program based upon the optimization request received from the Configuration Language Engine: “At Google, many teams have moved to an approach we call Intent-based Capacity Planning. The basic premise of this approach is to programmatically encode the dependencies and parameters (intent) of a service’s needs, and use that encoding to autogenerate an allocation plan that details which resources go to which service, in which cluster.”

Capacity planning needs to encapsulate reliability engineering and propose designs that can operate under the simultaneous occurrence of planned and unplanned outage. This type of thinking makes “N+2” configurations a classical pattern to follow. Capacity planning should ensure that an acceptable service may still be delivered when the two largest instances are unavailable. Similarly, capacity planning needs to be mixed with load balancing policies. The book covers load balancing policies (such as Weighted Round Robin) in depth because they are a critical part of QoS : “Avoiding overload is a goal of load balancing policies. But no matter how efficient your load balancing policy, eventually some part of your system will become overloaded. Gracefully handling overload conditions is fundamental to running a reliable serving system.” Queues management policies (FIFO, LIFO, CoDel, …) have similarly a key impact on QoS, which is precisely the topic of the OAI simulations that I ran 10 years ago (cf. “Self-adaptive and self-healing message passing strategies for process-oriented integration infrastructures”)

2.6 The Practice of the “Error Budget”

The concept of the “error budget” is a key contribution of Google SRE practice, where the operations organization finds a “QoS homeostasis” by balancing its operations requirement with the current state of its services. The aversion to change and the rigorous check-up of all necessary tests is not cast in stone but adapts to the current quality of service as seen by the user. The SRE team “balances reliability and the pace of innovation with error budgets (see “Motivation for Error Budgets”), which define the acceptable level of failure for a service, over some period. …As long as the service hasn’t spent its error budget for the month through the background rate of errors plus any downtime, the development team is free (within reason) to launch new features, updates, and so on … If the error budget is spent, the service freezes changes (except urgent security and bug fixes addressing any cause of the increased errors) until either the service has earned back room in the budget, or the month resets.”

The error budget must be managed from a customer-centric perspective. The “unavailability budget” is computed as the outage time that is seen by the end user: “Measuring error rates and latency at the Gmail client, rather than at the server, resulted in a substantial reduction in our assessment of Gmail availability, and prompted changes to both Gmail client and server code. The result was that Gmail went from about 99.0% available to over 99.9% available in a few years.”

There is an implicit but crucial point here: the outage/failure/error is no longer a “bad thing” that must be avoided at all costs, it is an “expected part of the process of innovation”, something that must be managed rationally. Launches are necessary, but the SRE team has multiple ways of controlling the launch process : “Google defines a launch as any new code that introduces an externally visible change to an application. Depending on a launch’s characteristics — the combination of attributes, the timing, the number of steps involved, and the complexity — the launch process can vary greatly. According to this definition, Google sometimes performs up to 70 launches per week.”

Among the variables that can be acted, the SRE team must schedule the planned downtimes and the disaster recovery tests. Google spends a lot of energy on simulations of crashes and live drills. As we shall see later, the only way to develop and certify recovery capabilities is to test them regularly. Living with a failure is a part of what it means to run operations. As pointed out in the book, most people reaction when facing a failure is to start troubleshooting instantly and spend all their energy to find a root cause as quickly as possible. The proper set of action (as with a human accident) is to make first the system work as well as possible in a degraded mode. This is the type of behavior that you can only get with practice. Once again, failures are a natural occurrence of large scale complex systems. Minimizing impact of these failures is the role of reliability engineering.

2.7 Technical Debt and Recovery Capabilities

Complex systems that evolve constantly are poised to produce unnecessary weight and complexity that should be assessed and removed periodically. This is the principle of “taking constant care of one’s garden” (removing weeds, pruning, digging, …), that is, removing “technical debt” in the words of information systems : “Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.” This is also why the authors warn us against “over-engineering” reliability : “that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs.”

When dealing with large-scale systems, data engineering becomes a critical skill. In the past 20 years I have lived through a number of very significant outages. Throughout theses (tough) experiences, I have noticed that the time to recover, to move or to install new data sets is always much longer than planned. In most cases, the recovery plan is executed (hence the disaster is averted) but the total recovery time is longer than what was planned initially, because unit operations are longer. Google report similar stories in this book: “Processes and practices applied to volumes of data measured in T (terabytes) don’t scale well to data measured in E (exabytes). Validating, copying, and performing round-trip tests on a few gigabytes of structured data is an interesting problem.”

A fair amount of pages deals with backup and recovery. Obviously, what matters is recovery, backups are just a tool. As a subcomponent of disaster recovery, data recovery must be tested regularly. Designing data recovery is a hard problem. More generally, designing reliable distributed systems is hard. As mentioned earlier, the book points out the necessity of deep system skills and expertise: “Today, we hear a brazen culture of “just show me the code.” A culture of “ask no questions” has grown up around open source, where community rather than expertise is championed. Google is a company that dared to think about the problems from first principles, and to employ top talent with a high proportion of PhDs.”

The proposed approach is to first create SRE teams with the proper mix of expertise: “To this end, Google always strives to staff its SRE teams with a mix of engineers with traditional software development experience and engineers with systems engineering experience.” Then for these teams to succeed, it is critical that the senior management recognizes reliability and quality of service as a strategic imperative for the company. In all companies that I know, there are many “operations anonymous heroes” that keep the systems running as well as possible. What distinguished the digital champions such as Google is the recognition that these individuals and teams play a major role in the value creation of the company and deserve a just recognition: “Another way to get started on the path to improving reliability for your organization is to formally recognize that work, or to find these people and foster what they do — reward it”.

2.8 SRE : a Team with End-to-End Responsibility About Operations

SREs are hybrid production teams, which combine ops and software engineering capabilities (they aim at making Google systems run themselves) with a dual ambition of customer satisfaction and continuous innovation. “SRE is concerned with several aspects of a service, which are collectively referred to as production. These aspects include the following: System architecture and interservice dependencies; Instrumentation, metrics, and monitoring; Emergency response, Capacity planning, Change management; Performance: availability, latency, and efficiency”.

SRE teams follow many of the DevOps principles (software engineering and operations skills working as a team, involvement of IT function in each phase of a system’s design and development, heavy reliance on automation, …). SRE teams are associated to a service (or a service domain) : they own the responsibility of the run for this service: “In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).”

Somewhere in the book, the authors notice that most concepts, principles and technique reported here are not special but rather part of a well-accepted state-of-the-art practice. However they also notice that a surprising number of ops team do not take these practices seriously, such as capacity planning, resulting in unnecessary failures.

As stated earlier, SREs make a heavy use of blameless postmortems as a tool for training and evangelization. “Google’s Postmortem Philosophy: The primary goals of writing a postmortem are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.” Blameless here means that the root cause analysis focus “on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior”. Postmortems are not meant to be kept for the SRE team but to be shared as extensively as possible in the company.

Another key ritual of the SRE are the production meetings, where SRE team members orchestrate the necessary knowledge sharing between all concerned stakeholders (including product owners and software architects) : “In general, these meetings are service-oriented; they are not directly about the status updates of individuals. The goal is for everyone to leave the meeting with an idea of what’s going on — the same idea. The other major goal of production meetings is to improve our services by bringing the wisdom of production to bear on our services”. A key role in the SRE team in the Launch Coordination Engineer, which requires “strong communication and leadership skills” to bring everyone together. In a true DevOps spirit, when the operational load is too heavy, product development teams should contribute so that SRE keep a balance between incident management and continuous improvement. System architects should attend these production meetings regularly to collect up-to-date availability, latency and throughput which are absolutely necessary for reliable system engineering. Architecture schemas that are purely functional lead to disastrous failures in operations.

2.9 The Lean and TQM roots of Google SREs

Throughout the book, the influence of Total Quality Management (e.g., Deming) and Lean Management is quite visible. There is a constant effort to organize work so as to avoid overloads and work with a regular activity flow. Edward Deming famous quote “cherish your mistakes” finds many echoes here: “Google operates under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.”

Running complex operations, especially disaster recovery and incident management, is more efficient with a playbook (lean “standard”) : “When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.” Another lean influence is the importance of “system understanding” and making sure that indicators are picked scarcely and wisely. What matters here is the ability for the team members to evaluate and reason properly about a system’s health: “Ineffective troubleshooting sessions are plagued by problems at the Triage, Examine, and Diagnose steps, often because of a lack of deep system understanding.”

The system engineering culture has many aspects in common with Lean SixSigma, namely the heavy use of statistics including the use of distributions rather than mean values: “We generally prefer to work with percentiles rather than the mean (arithmetic average) of a set of values. Doing so makes it possible to consider the long tail of data points, which often have significantly different (and more interesting) characteristics than the average”. Performance and reliability engineering require taking more than the average case into account.

Another lean trait is the search for constant elimination of dead weight (muda in the lean sense) which we mentioned earlier : “The term “software bloat” was coined to describe the tendency of software to become slower and bigger over time as a result of a constant stream of additional features.” “Software bloating” or “feature creep” is a common plague of software systems that is a clear instance of the Tragedy of the Commons. Adding more features always comes from a local view, where the benefit of the new feature is amplified and the systemic cost of adding to the whole system is reduced if not ignored. To encourage the team to adopt a “one system” mindset forces to re-evaluate with less emphasis on the new feature and more on the global resource consumption. Many resources such as available time (in an ops planning), bandwidth, or memory capacity lead to a “tragedy of the commons” situation where the sum of “small local incremental usages” translates into a global over-consumption.

3. Conclusion

Although this book review is actually quite shallow compared to the depth of the book, it is already substantial compared to what people expects when discussing digital transformation. My intent here is to underline the importance of execution with any digital transformation. As stated in the introduction and wonderfully explained in “The Web Giants”, quality of service is a crucial part of the customer experience. A company cannot be a digital leader without mastering the skills and the practices that are described in this book. The illustration on the right is taken from the Twitter launch of the NATF report on Artificial Intelligence and Machine Learning. What it says is that the Google SRE book should be part of the skill set of any companies that aim to develop a leading service based on data and AI. Engineering matters.

Site Reliability Engineering is foremost an ambition about people and culture. Here is a very short summary about an ideal “End to End” operations organization:

It is made of small autonomous teams that orchestrates, for instance through production meetings as well as postmortems, all stakeholders involved with the delivery of one (end-to-end) service.
The team owns the responsibility of this service: availability (reliability), latency & throughput , capacity planning, change management, monitoring, incident management and recovery.
This should be a customer-centric organization with is both measure-driven (analytics) and adaptive (flexible). Its operations guideline should fluctuate using an “error budget” to deliver stable quality of service and “change throughput” at the same time.
This team is necessarily “progress-centric” and operates under continuous improvement using postmortems and root causes analysis. The team has the diverse skill set to own and understand the complexity of the systems that it operates.
The organization strives towards as much automation as possible. This requires both to enroll software engineering skill and to let them operate on production systems.

This book tells mostly about reliability engineering but, as shown in the previous section, it is also a textbook about distributed systems engineering. Here is a short set of principles for building reliable distributed systems that I have extracted from this reading and which I will comment in a future post (the reference to “Biology of Distributed Information Systems” – the title chosen for this blog 10 yeas ago – is quite deep) :

Leverage APIs to develop a multi-modal cell structure with variable rate of change, when change flows from the outside towards the inside of the whole system (like a cell)
Organize complexity in layers where advanced systems are backed up by simpler “life-support” systems, the lower the complexity the higher the availability (biomimicry)
Develop an event-driven flow architecture to design reactive systems that are scalable, reliable and open
Use self-monitoring and “digital twin” to provide self-adaptation, self-optimization and self-healing
Journey towards abstraction to move to “server less” systems (necessary to combine SaaS, cloud and on-premice operations)
Leverage lean thinking to deliver robustness and flexibility through more available capacity
Don’t fight CAP (theorem) and develop a reliable, high-availability, eventually consistent data architecture (think of data consistency as a movie rather than a snapshot).

Biology of Distributed Information Systems

Wednesday, October 31, 2018

From Digital Transformation to Service Architecture and Reliability Engineering