1. Introduction
This post talks about Digital Transformation from an
operational excellence perspective. I propose a simplified book review of “the
textbook” on reliability engineering
as a path to hammer some key principles that make digital systems efficient.
There should be no surprises with this approach, since I have been advocating
for a while in my blogs that:
- Digital transformation is about digital capability
- Customer satisfaction is first about operational excellence
- The playground has changed - the GAFA have set new standards for operational excellence, as explained by the Web Giants book.
- I have been privileged to start my exposure to availability lectures as a teenager from my departed father Paul Caseau on topics such as queuing theory, Jackson networks and Markov processes. I was told about MTBF and MTTR from practical examples such as shoes when still very young.
- I started my career as an Operations Research scientist with many years spent on scheduling, planning and routing, with the later addition of stochastic optimization.
- I then got my first exposure to real-life QoS issues as Bouygues Telecom CIO 15 years ago. This resulted in my attempt to formalize what I saw in books, a new piece of simulation and research about SLA called “Optimization of Application Integration” and later the opening of this blog. The paper “Self-adaptive Middleware: Supporting Business Processes and Service Level Agreements” is very similar to what this post will describe.
- I had the remarkable opportunity of a few in-depth conversations with key SRE people at Google including world-class stars that are quoted in the book that I will present today.
- Last, my experience as a lecturer at Ecole Polytechnique has a helped me to formalize what I have learned from the trenches.
I plan to share a detailed summary of the “Google
SRE” book because most of what I have learned and know from experience may be
found in this book. On the one hand, this is a book about large-scale distributed systems and how to design and operate them
in a reliable manner. This is a complicated topic and there is a wealth of
knowledge to learn and share. On the other hand, the technology “lego box” has
changed : things that were very hard to design 10 years ago are much easier
today thanks to open source technologies such as cloud, containers, distributed
orchestration, distributed storage platforms, etc. High-availability is no
longer a “high end” feature and many small companies including startups build
high-availability architectures with QoS performance that would have been hard
to achieve for a telco 20 years ago. Many of the open source pieces that
created this “reliability revolution” come from Google components that are
described in the SRE book. Readers of this blog who know about “Autonomic Computing”
will notice that Google has done nothing less than delivering an ambition
proposed by IBM over 15 years ago. Self-monitoring,
Self-optimization, Self-provisioning, Self-healing are characteristics of
modern reliable systems
This is not a linear book review, because the material
is very deep (the book is 475 pages long). I selected nine key topics that are
closely related to digital transformation and to my experience as a CIO; then I
tried to summarize the key insights from the book. Managers are not the main
audience for this book, which is foremost written for practitioners. For
technical readers this book is a treasure trove of insights, examples, and
practical advice. However, I strongly suggest anyone to open this book and read
at least the first hundred pages.
2. Google Site Reliability Engineering
The book “Site Reliability Engineering - How Google runs production systems”,
edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy,
is a collective book from Google SRE members: “This book is a series of essays written by members and alumni of
Google’s Site Reliability Engineering organization. It’s much more like
conference proceedings than it is like a standard book by an author or a small
number of authors”. This heavy book is
filled with case stories and technical details about the different tools that
are used by Google engineers. Although the 33 chapters are mostly focused on
practical issues and problems, this is also a principled book about computing: “We
apply the principles of computer science and engineering to the design and
development of computing systems: generally, large distributed ones”. True
to the principles that I will present very soon, the book is rich with failure
analysis and “anatomy of unmanaged incident” examples.
The relationship with the theme of this blog, “biology
of distributed information systems” will become self-evident when you read this
post. As mentioned earlier, the ambition of “autonomic computing” is also deeply embedded: “The global computer … must be self-repairing to operate once it grows
past a certain size, due to the essentially statistically guaranteed large
number of failures taking place every second”.
2.1. Information System Reliability is a Strategic Imperative
The book starts with the observation that the real world of information
systems is chaotic, both because of the size and the complexity, but also
because of the astounding number of changes that affect the systems daily. The
first chapter is written by Ben Treynor Sloss, Google’s VP for 24/7 Operations,
originator of the term SRE, who claims that reliability is the most fundamental feature of any product. Ben
Treynor is the father of SRE: “SRE is
what happens when you ask a software engineer to design an operations team.
When I joined Google in 2003 and was tasked with running a “Production Team” of
seven engineers, my entire life up to that point had been software engineering”.
SRE is the story of a quest towards reliability through automation and
simplification. He quotes C.A.R. Hoare in his Turing Award lecture: “The price of reliability is the pursuit of
the utmost simplicity”.
SRE is the brainchild of a scientist and engineer who looks at failure
and reliability with cold unemotional eyes and tries to maximize the outcome
while reducing the effort. By construction it is a “DevOps” approach where the
skills of software engineering are applied to improving operations. Obviously,
this approach is draws from the scaling and size issues experienced at Google
over the years: “Ensuring that the cost of maintenance scales sublinearly
with the size of the service is key to making monitoring (and all sustaining
operations work) maintainable”.
SREs (Site Reliability Engineers) work with the complete product team,
from product owners and development teams to operations specialists and
partner. Following engineering practices, business and users goals are
translated into explicit and measurable goals that can be “engineered”. With no
surprise to readers of this blog, key metrics that are used throughout the book
are availability, latency and throughput
: “User-facing
serving systems … generally care about availability, latency, and throughput”.
As an engineer, one knows that a price must be paid for any complex
requirement, including performance and reliability. Failures and mistakes are
managed in a cold engineering approach, without exaggerating the requirements.
In many places, the book warns us against looking for “over-safe,
over-expensive” approaches, but to keep looking for a balance.
2.2. Distributed Systems
As mentioned in the introduction, this is a book about distributed
systems: “As SREs, we work with
large-scale, complex, distributed systems.” Distributed systems engineering
is a wonderfully exciting discipline, albeit a difficult one. It takes time,
energy and dedication to become an expert in such matter. It also takes
humility and curiosity. “Systems are complex. It’s quite likely that
there are multiple factors, each of which individually is not the cause, but
which taken jointly are causes. Real systems are also often path-dependent, so
that they must be in a specific state before a failure occurs.”
The emphasis on “distributed” here means that one must learn to step
back and look at the whole system, not the platform or the component that is
actively being built. One of my key action as a manager is to promote the “One
System” culture, which is the understanding that we (as an organization) are
building ONE large, complex and distributed system. This requires observing and
sharing : “There are many ways to
simplify and speed troubleshooting. Perhaps the most fundamental are: Building
observability — with both white-box metrics and structured logs — into each
component from the ground up. Designing systems with well-understood and
observable interfaces between components”.
The emphasis about logging and configuration management is obvious
throughout the book. There are far too many practical recommendations, such as
the use of different verbosity levels, to be reproduced here in this summary.
Some of them I learned many years ago at Bellcore,
when working with true distributed system experts such as my friend Quoc-Bao Nguyen. Being
able to debug, re-parameter, re-configure without shutting the system down has
been a good practice for decades. On the other hand, distributed issues that
used be very hard to solve, such as distributed locking, are now easier because
of the wealth of scalable and robust open-source solutions. Still, one must
understand the complex nature of distributed system and learn about tested protocols:
“Whenever you see leader election,
critical shared state, or distributed locking, we recommend using distributed
consensus systems that have been formally proven and tested thoroughly.
Informal approaches to solving this problem can lead to outages, and more
insidiously, to subtle and hard-to-fix data consistency problems that may
prolong outages in your system unnecessarily”.
The same remark can be made about the complexity of distributed
large-scale storage. One must understand and accept the CAP theorem and learn to
live with eventual consistency (or learn to live with right-time versus real-time)
: “A growing number of distributed
datastore technologies provide a different set of semantics known as BASE
(Basically Available, Soft state, and Eventual consistency). Datastores that
support BASE semantics have useful applications for certain kinds of data and
can handle large volumes of data and transactions that would be much more costly,
and perhaps altogether infeasible, with datastores that support ACID semantics”.
As mentioned earlier, the book is full of examples of self-adaptive
mechanisms to make distributed systems more reliable. Techniques that are
heavily used in network management such as exponential decay have obviously
their place at the application level. Throttling
is a good example of self-adaptation: “Client-side
throttling addresses this problem. When a client detects that a significant
portion of its recent requests have been rejected due to “out of quota” errors,
it starts self-regulating and caps the amount of outgoing traffic it generates.
Requests above the cap fail locally without even reaching the network”
2.3 Distributed Systems Reliability at Google
Google has developed a wealth of know-how on distributed systems
reliability. The goal of this book is to share some of that knowledge because
it is widely applicable in a context larger than Google or the web community.
Reliability is not a set of simple recipes; it is a discipline of redundant
protection through multiple layers: “Given the many ways data can be lost (as
described previously), there is no silver bullet that guards against the many
combinations of failure modes. Instead, you need defense in depth. Defense in
depth comprises multiple layers, with each successive layer of defense
conferring protection from progressively less common data loss scenarios”. Some
of the techniques, such as idempotent
scripts, are now well understood in the DevOps community, but this book
offers a comprehensive survey which should be useful to most of us who are not
true geeks.
The model of the SRE team is to build a small team of highly trained ops
specialists with enough software engineering expertise to understand and manage
the systems: “Ultimately, SRE’s goal is to follow a similar
course. An SRE team should be as compact as possible and operate at a high
level of abstraction, relying upon lots of backup systems as failsafes and
thoughtful APIs to communicate with the systems” … “In order to work at scale, teams must be
self-sufficient. Release engineering has developed best practices and tools
that allow our product development teams to control and run their own release
processes”. This dual DevOps capability is critical to perform “ProdTests”
(tests on the production environment). A SRE team should be autonomous in its
decision process, but it works together with the rest of the organization (SRE
teams supplements the engineering organization, it does not replace it).
How does one develop these highly trained specialists? By doing and
mostly by learning from previous failures: “There is no better way to learn than
to document what has broken in the past. History is about learning from
everyone’s mistakes. Be thorough, be honest, but most of all, ask hard
questions”. The customer
focus is critical and visible throughout the book. SRE must understand the user
perspective and take extra care of customer-facing systems : “The frontend infrastructure consists of
reverse proxy and load balancing systems running close to the edge of our
network. These are the systems that, among other things, serve as one endpoint
of the connections from end users (e.g., terminate TCP from the user’s
browser). Given their critical role, we engineer these systems to deliver an
extremely high level of reliability.”
One key responsibility of SREs is to assist release engineering, that
is, how to test and release a new version of a software component or service.
There is a lot of emphasis on gradual releases, which goes hand in hand with
continuous release: “Almost all updates
to Google’s services proceed gradually, according to a defined process, with
appropriate verification steps interspersed. A new server might be installed on
a few machines in one datacenter and observed for a defined period of time. If
all looks well, the server is installed on all machines in one datacenter,
observed again, and then installed on all machines globally.”
2.4 Automation and Monitoring
As the book points out, the simplest and most powerful principles behind
reliability engineering are automation and monitoring. Automation is the
natural solution to the efficiency/sublinear requirement made earlier, and the
best way to remove a large part of human errors which are still the roots of
most system failures. “There’s an
additional benefit for systems where automation is used to resolve common
faults in a system (a frequent situation for SRE-created automation). If
automation runs regularly and successfully enough, the result is a reduced mean
time to repair (MTTR) for those common faults.” The history of SRE, as told
by Ben Treynor, comes from the automatization of operations by software
engineers. Automation is a continuous task, not a once-for-all milestone. “Automation code, like unit test code, dies
when the maintaining team isn’t obsessive about keeping the code in sync with
the codebase it covers.”
Operations need to be automated and monitored. Monitoring is, as told
earlier when we talked about the necessity of observation, the heart of
reliable operations. “Whether it is at
Google or elsewhere, monitoring is an absolutely essential component of doing
the right thing in production. If you can’t monitor a service, you don’t know
what’s happening, and if you’re blind to what’s happening, you can’t be
reliable.” Monitoring and automation go hand in hand. Alerts produced by
monitoring should trigger actions which are as automated as possible – back to
the autonomic computing ambition – human intervention should be seen as the
last resort option. Automation should come with heavy instrumentation to feed
the monitoring and ensure proper execution (Automation is no guarantee against
human errors, at first – only continuous improvement produced fail-proof
scripts). “Running a service with a team
that relies on manual intervention for both change management and event
handling becomes expensive as the service and/or traffic to the service grows,
because the size of the team necessarily scales with the load generated by the
system.”
The regular practice of logging and monitoring provides massive opportunities
of using Artificial Intelligence and Machine Learning for operations. Using
algorithm ranging from classical time series data mining (e.g., with Splunk) to more advanced machine learning
techniques to detect correlations and patterns, one may generate advanced
alerts when the system requires tuning to avoid entering a hazard zone in the
future. Automation and monitoring act on the two components of reliability:
reducing MTBF (less error with automation, early detection and prevention with
monitoring) and reducing MTTR (better diagnosis with monitoring and faster
reparation through automation).
2.5. The Most Common Cause of Loss of Service Availability is Change
The SRE team at Google has found that roughly 70% of outages are due to
changes in a live system: “Most outages are caused by some kind of change
— a new configuration, a new feature launch, or a new type of user traffic —
the two teams’ goals are fundamentally in tension”. This tension between
the need for change (and for continuous delivery) and the need for reliable
operations is wonderfully covered by the great book “Accelerate:
The Science of Lean Software and DevOps: Building and Scaling High Performing
Technology Organizations”, by Nicole Foresgreen, Jez Humble
and Gene Kim. The only path forward is to automate change and treat it as
seriously as possible at the same time. Companies that succeed to deliver
high-availability reliable quality of service are those who perform
frequent changes.
API management is a core
dimension of modern system engineering, including from an operations
perspective as is illustrated many times in this book. API modularity is
critical for keeping systems simple. API change management is equally critical and helps to understand while reliability engineering and software
engineering are mutually dependent: “While
the modularity that APIs offer may seem straightforward, it is not so apparent
that the notion of modularity also extends to how changes to APIs are
introduced. Just a single change to an API can force developers to rebuild
their entire system and run the risk of introducing new bugs.”
A key component of change management
is Capacity planning. This topic is covered in depth in this book,
with a systemic vision (not only one platform as a time, but with a global
perspective). “Good capacity
planning can reduce the probability that a cascading failure will occur.
Capacity planning should be coupled with performance testing to determine the
load at which the service will fail.” System capacity planning is a complex modeling
and engineering task that may leverage techniques from operations research (cf.
the Auxon solver that formulates a giant mixed-integer or linear program
based upon the optimization request received from the Configuration Language
Engine: “At Google, many teams have moved to an approach we call
Intent-based Capacity Planning. The basic premise of this approach is to
programmatically encode the dependencies and parameters (intent) of a service’s
needs, and use that encoding to autogenerate an allocation plan that details
which resources go to which service, in which cluster.”
Capacity planning needs to encapsulate reliability engineering and
propose designs that can operate under the simultaneous occurrence of planned
and unplanned outage. This type of thinking makes “N+2” configurations a
classical pattern to follow. Capacity planning should ensure that an acceptable
service may still be delivered when the two largest instances are unavailable.
Similarly, capacity planning needs to be mixed with load balancing policies. The
book covers load balancing policies (such as Weighted Round Robin) in depth
because they are a critical part of QoS : “Avoiding
overload is a goal of load balancing policies. But no matter how efficient your
load balancing policy, eventually some part of your system will become
overloaded. Gracefully handling overload conditions is fundamental to running a
reliable serving system.” Queues
management policies (FIFO, LIFO, CoDel, …) have similarly a key impact on QoS,
which is precisely the topic of the OAI simulations that I ran 10 years ago (cf.
“Self-adaptive and
self-healing message passing strategies for process-oriented integration
infrastructures”)
2.6 The Practice of the “Error Budget”
The concept of the “error budget” is a key contribution of Google SRE
practice, where the operations organization finds a “QoS homeostasis” by
balancing its operations requirement with the current state of its services. The
aversion to change and the rigorous check-up of all necessary tests is not cast
in stone but adapts to the current quality of service as seen by the user. The
SRE team “balances reliability and the
pace of innovation with error budgets (see “Motivation for Error Budgets”), which
define the acceptable level of failure for a service, over some period. …As
long as the service hasn’t spent its error budget for the month through the
background rate of errors plus any downtime, the development team is free
(within reason) to launch new features, updates, and so on … If the error budget is spent, the service
freezes changes (except urgent security and bug fixes addressing any cause of
the increased errors) until either the service has earned back room in the
budget, or the month resets.”
The error budget must be managed from a customer-centric perspective. The
“unavailability budget” is computed as the outage time that is seen by the end
user: “Measuring error rates and latency
at the Gmail client, rather than at the server, resulted in a substantial
reduction in our assessment of Gmail availability, and prompted changes to both
Gmail client and server code. The result was that Gmail went from about 99.0%
available to over 99.9% available in a few years.”
There is an implicit but crucial point here: the outage/failure/error is
no longer a “bad thing” that must be avoided at all costs, it is an “expected
part of the process of innovation”, something that must be managed rationally. Launches are necessary, but the SRE team has multiple
ways of controlling the launch process : “Google
defines a launch as any new code that introduces an externally visible change
to an application. Depending on a launch’s characteristics — the combination of
attributes, the timing, the number of steps involved, and the complexity — the
launch process can vary greatly. According to this definition, Google sometimes
performs up to 70 launches per week.”
Among the variables that can be acted, the SRE team must schedule the
planned downtimes and the disaster
recovery tests. Google spends a lot of energy on simulations of crashes and
live drills. As we shall see later, the only way to develop and certify
recovery capabilities is to test them regularly. Living with a failure is a
part of what it means to run operations. As pointed out in the book, most
people reaction when facing a failure is to start troubleshooting instantly and
spend all their energy to find a root cause as quickly as possible. The proper
set of action (as with a human accident) is to make first the system work as
well as possible in a degraded mode. This is the type of behavior that you can
only get with practice. Once again, failures
are a natural occurrence of large scale complex systems. Minimizing impact
of these failures is the role of reliability engineering.
2.7 Technical Debt and Recovery Capabilities
Complex systems that evolve constantly are poised to produce unnecessary
weight and complexity that should be assessed and removed periodically. This is
the principle of “taking constant care of one’s garden” (removing weeds, pruning,
digging, …), that is, removing
“technical debt” in the words of information systems : “Without constant engineering, operations load increases and teams will
need more people just to keep pace with the workload.” This is also why the authors warn us against “over-engineering”
reliability : “that would waste opportunities to add features to the system, clean up
technical debt, or reduce its operational costs.”
When dealing with large-scale systems, data engineering becomes a critical skill. In the past 20 years I
have lived through a number of very significant outages. Throughout theses
(tough) experiences, I have noticed that the time to recover, to move or to
install new data sets is always much longer than planned. In most cases, the
recovery plan is executed (hence the disaster is averted) but the total
recovery time is longer than what was planned initially, because unit
operations are longer. Google report similar stories in this book: “Processes and practices applied to volumes
of data measured in T (terabytes) don’t scale well to data measured in E
(exabytes). Validating, copying, and performing round-trip tests on a few
gigabytes of structured data is an interesting problem.”
A fair amount of pages deals with backup and recovery. Obviously, what matters
is recovery, backups are just a tool. As a subcomponent of disaster recovery,
data recovery must be tested regularly. Designing data recovery is a hard
problem. More generally, designing reliable distributed systems is hard. As
mentioned earlier, the book points out the necessity of deep system skills and
expertise: “Today, we hear a brazen
culture of “just show me the code.” A culture of “ask no questions” has grown
up around open source, where community rather than expertise is championed.
Google is a company that dared to think about the problems from first
principles, and to employ top talent with a high proportion of PhDs.”
The proposed approach is to first create SRE teams with the proper mix
of expertise: “To this end, Google always strives to staff its SRE
teams with a mix of engineers with traditional software development experience
and engineers with systems engineering experience.” Then for these teams to succeed, it is critical that the
senior management recognizes reliability and quality of service as a strategic
imperative for the company. In all companies that I know, there are many “operations
anonymous heroes” that keep the systems running as well as possible. What distinguished
the digital champions such as Google is the recognition that these individuals
and teams play a major role in the value creation of the company and deserve a
just recognition: “Another way to
get started on the path to improving reliability for your organization is to
formally recognize that work, or to find these people and foster what they do —
reward it”.
2.8 SRE : a Team with End-to-End Responsibility About Operations
SREs are hybrid production teams, which combine ops and software
engineering capabilities (they aim at making Google systems run themselves)
with a dual ambition of customer satisfaction and continuous innovation. “SRE is concerned with several aspects of a service, which are
collectively referred to as production. These aspects include the following:
System architecture and interservice dependencies; Instrumentation, metrics,
and monitoring; Emergency response, Capacity planning, Change management; Performance: availability, latency, and
efficiency”.
SRE teams follow many of the DevOps principles (software engineering and
operations skills working as a team, involvement of IT function in each phase
of a system’s design and development, heavy reliance on automation, …). SRE
teams are associated to a service (or a service domain) : they own the
responsibility of the run for this service: “In
general, an SRE team is responsible for the availability, latency, performance,
efficiency, change management, monitoring, emergency response, and capacity
planning of their service(s).”
Somewhere
in the book, the authors notice that most concepts, principles and technique
reported here are not special but rather part of a well-accepted
state-of-the-art practice. However they also notice that a surprising number of
ops team do not take these practices seriously, such as capacity planning, resulting
in unnecessary failures.
As stated earlier, SREs make a heavy use of blameless postmortems as a tool for training and evangelization. “Google’s Postmortem Philosophy: The primary
goals of writing a postmortem are to ensure that the incident is documented,
that all contributing root cause(s) are well understood, and, especially, that
effective preventive actions are put in place to reduce the likelihood and/or
impact of recurrence.” Blameless here means that the root cause analysis
focus “on identifying the contributing
causes of the incident without indicting any individual or team for bad or
inappropriate behavior”. Postmortems
are not meant to be kept for the SRE team but to be shared as extensively as
possible in the company.
Another key ritual of the SRE are the production meetings, where SRE
team members orchestrate the necessary knowledge sharing between all concerned
stakeholders (including product owners and software architects) : “In general, these meetings are service-oriented; they are not directly
about the status updates of individuals. The goal is for everyone to leave the
meeting with an idea of what’s going on — the same idea. The other major goal
of production meetings is to improve our services by bringing the wisdom of
production to bear on our services”. A key role in the SRE team in the Launch Coordination
Engineer, which requires “strong
communication and leadership skills” to bring everyone together. In a true
DevOps spirit, when the operational load is too heavy, product development
teams should contribute so that SRE keep a balance between incident management
and continuous improvement. System architects should attend these production
meetings regularly to collect up-to-date availability, latency and throughput
which are absolutely necessary for reliable system engineering. Architecture
schemas that are purely functional lead to disastrous failures in operations.
2.9 The Lean and TQM roots of Google SREs
Throughout the book, the influence of Total Quality Management (e.g.,
Deming) and Lean Management is quite visible. There is a constant effort to
organize work so as to avoid overloads and work with a regular activity flow. Edward
Deming famous quote “cherish your mistakes” finds many echoes here: “Google operates under a blame-free
postmortem culture, with the goal of exposing faults and applying engineering
to fix these faults, rather than avoiding or minimizing them.”
Running complex operations, especially disaster recovery and incident
management, is more efficient with a playbook (lean “standard”) : “When humans are necessary, we have found
that thinking through and recording the best practices ahead of time in a
“playbook” produces roughly a 3x improvement in MTTR as compared to the
strategy of “winging it.” Another lean influence is the importance of “system
understanding” and making sure that indicators are picked scarcely and wisely. What
matters here is the ability for the team members to evaluate and reason
properly about a system’s health: “Ineffective troubleshooting sessions
are plagued by problems at the Triage, Examine, and Diagnose steps, often
because of a lack of deep system understanding.”
The system engineering culture has many aspects in common with Lean
SixSigma, namely the heavy use of statistics including the use of distributions
rather than mean values: “We generally
prefer to work with percentiles rather than the mean (arithmetic average) of a
set of values. Doing so makes it possible to consider the long tail of
data points, which often have significantly different (and more interesting) characteristics
than the average”. Performance and reliability engineering require taking
more than the average case into account.
Another lean trait is the search for constant elimination of dead weight
(muda in the lean
sense) which we mentioned earlier : “The
term “software bloat” was coined to describe the tendency of software to become
slower and bigger over time as a result of a constant stream of additional
features.” “Software bloating” or
“feature creep” is a common
plague of software systems that is a clear instance of the Tragedy of the
Commons. Adding more features always comes from a local view, where the benefit of the new feature is amplified and
the systemic cost of adding to the whole system is reduced if not ignored. To encourage
the team to adopt a “one system”
mindset forces to re-evaluate with less emphasis on the new feature and more on
the global resource consumption. Many resources such as available time (in an
ops planning), bandwidth, or memory capacity lead to a “tragedy of the commons” situation where the sum of “small local
incremental usages” translates into a global over-consumption.
3. Conclusion
Although this book review is actually quite shallow compared to the
depth of the book, it is already substantial compared to what people expects
when discussing digital transformation. My intent here is to underline the importance
of execution with any digital transformation. As stated in the introduction and
wonderfully explained in “The Web Giants”, quality of service is a crucial part
of the customer experience. A company cannot be a digital leader without
mastering the skills and the practices that are described in this book. The
illustration on the right is taken from the Twitter launch of the NATF
report on Artificial Intelligence and Machine Learning. What it says is
that the Google SRE book should be part of the skill set of any companies that
aim to develop a leading service based on data and AI. Engineering matters.
Site Reliability Engineering is foremost an ambition about people and
culture. Here is a very short summary about an ideal “End to End” operations
organization:
- It is made of small autonomous teams that orchestrates, for instance through production meetings as well as postmortems, all stakeholders involved with the delivery of one (end-to-end) service.
- The team owns the responsibility of this service: availability (reliability), latency & throughput , capacity planning, change management, monitoring, incident management and recovery.
- This should be a customer-centric organization with is both measure-driven (analytics) and adaptive (flexible). Its operations guideline should fluctuate using an “error budget” to deliver stable quality of service and “change throughput” at the same time.
- This team is necessarily “progress-centric” and operates under continuous improvement using postmortems and root causes analysis. The team has the diverse skill set to own and understand the complexity of the systems that it operates.
- The organization strives towards as much automation as possible. This requires both to enroll software engineering skill and to let them operate on production systems.
This book tells mostly about reliability engineering but, as shown in
the previous section, it is also a textbook about distributed systems
engineering. Here is a short set of principles for building reliable
distributed systems that I have extracted from this reading and which I will
comment in a future post (the reference to “Biology of Distributed Information Systems” – the
title chosen for this blog 10 yeas ago – is quite deep) :
- Leverage APIs to develop a multi-modal cell structure with variable rate of change, when change flows from the outside towards the inside of the whole system (like a cell)
- Organize complexity in layers where advanced systems are backed up by simpler “life-support” systems, the lower the complexity the higher the availability (biomimicry)
- Develop an event-driven flow architecture to design reactive systems that are scalable, reliable and open
- Use self-monitoring and “digital twin” to provide self-adaptation, self-optimization and self-healing
- Journey towards abstraction to move to “server less” systems (necessary to combine SaaS, cloud and on-premice operations)
- Leverage lean thinking to deliver robustness and flexibility through more available capacity
- Don’t fight CAP (theorem) and develop a reliable, high-availability, eventually consistent data architecture (think of data consistency as a movie rather than a snapshot).