Wednesday, October 31, 2018

From Digital Transformation to Service Architecture and Reliability Engineering

1. Introduction

This post talks about Digital Transformation from an operational excellence perspective. I propose a simplified book review of “the textbook” on reliability engineering as a path to hammer some key principles that make digital systems efficient. There should be no surprises with this approach, since I have been advocating for a while in my blogs that:
Systems’ quality of service (QoS) is a topic that is very close to my heart. I do not consider myself a true expert since I have not spent tens of years running systems full time daily (what it takes to be a true Ops expert) but I have a long experience thinking about and observing firsthand what makes systems reliable :
  • I have been privileged to start my exposure to availability lectures as a teenager from my departed father Paul Caseau on topics such as queuing theory, Jackson networks and Markov processes. I was told about MTBF and MTTR from practical examples such as shoes when still very young.
  • I started my career as an Operations Research scientist with many years spent on scheduling, planning and routing, with the later addition of stochastic optimization.
  • I then got my first exposure to real-life QoS issues as Bouygues Telecom CIO 15 years ago. This resulted in my attempt to formalize what I saw in books, a new piece of simulation and research about SLA called “Optimization of Application Integration” and later the opening of this blog. The paper “Self-adaptive Middleware: Supporting Business Processes and Service Level Agreements” is very similar to what this post will describe.
  • I had the remarkable opportunity of a few in-depth conversations with key SRE people at Google including world-class stars that are quoted in the book that I will present today.
  • Last, my experience as a lecturer at Ecole Polytechnique has a helped me to formalize what I have learned from the trenches.

I plan to share a detailed summary of the “Google SRE” book because most of what I have learned and know from experience may be found in this book. On the one hand, this is a book about large-scale distributed systems and how to design and operate them in a reliable manner. This is a complicated topic and there is a wealth of knowledge to learn and share. On the other hand, the technology “lego box” has changed : things that were very hard to design 10 years ago are much easier today thanks to open source technologies such as cloud, containers, distributed orchestration, distributed storage platforms, etc. High-availability is no longer a “high end” feature and many small companies including startups build high-availability architectures with QoS performance that would have been hard to achieve for a telco 20 years ago. Many of the open source pieces that created this “reliability revolution” come from Google components that are described in the SRE book. Readers of this blog who know about “Autonomic Computing” will notice that Google has done nothing less than delivering an ambition proposed by IBM over 15 years ago. Self-monitoring, Self-optimization, Self-provisioning, Self-healing are characteristics of modern reliable systems

This is not a linear book review, because the material is very deep (the book is 475 pages long). I selected nine key topics that are closely related to digital transformation and to my experience as a CIO; then I tried to summarize the key insights from the book. Managers are not the main audience for this book, which is foremost written for practitioners. For technical readers this book is a treasure trove of insights, examples, and practical advice. However, I strongly suggest anyone to open this book and read at least the first hundred pages.

2. Google Site Reliability Engineering

The book “Site Reliability Engineering - How Google runs production systems”, edited by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy, is a collective book from Google SRE members: “This book is a series of essays written by members and alumni of Google’s Site Reliability Engineering organization. It’s much more like conference proceedings than it is like a standard book by an author or a small number of authors”. This heavy book is filled with case stories and technical details about the different tools that are used by Google engineers. Although the 33 chapters are mostly focused on practical issues and problems, this is also a principled book about computing:  We apply the principles of computer science and engineering to the design and development of computing systems: generally, large distributed ones”. True to the principles that I will present very soon, the book is rich with failure analysis and “anatomy of unmanaged incident” examples.

The relationship with the theme of this blog, “biology of distributed information systems” will become self-evident when you read this post. As mentioned earlier, the ambition of “autonomic computing” is also deeply embedded: “The global computer … must be self-repairing to operate once it grows past a certain size, due to the essentially statistically guaranteed large number of failures taking place every second”. 

2.1. Information System Reliability is a Strategic Imperative

The book starts with the observation that the real world of information systems is chaotic, both because of the size and the complexity, but also because of the astounding number of changes that affect the systems daily. The first chapter is written by Ben Treynor Sloss, Google’s VP for 24/7 Operations, originator of the term SRE, who claims that reliability is the most fundamental feature of any product. Ben Treynor is the father of SRE: “SRE is what happens when you ask a software engineer to design an operations team. When I joined Google in 2003 and was tasked with running a “Production Team” of seven engineers, my entire life up to that point had been software engineering”. SRE is the story of a quest towards reliability through automation and simplification. He quotes C.A.R. Hoare in his Turing Award lecture: “The price of reliability is the pursuit of the utmost simplicity”.

SRE is the brainchild of a scientist and engineer who looks at failure and reliability with cold unemotional eyes and tries to maximize the outcome while reducing the effort. By construction it is a “DevOps” approach where the skills of software engineering are applied to improving operations. Obviously, this approach is draws from the scaling and size issues experienced at Google over the years: “Ensuring that the cost of maintenance scales sublinearly with the size of the service is key to making monitoring (and all sustaining operations work) maintainable”.

SREs (Site Reliability Engineers) work with the complete product team, from product owners and development teams to operations specialists and partner. Following engineering practices, business and users goals are translated into explicit and measurable goals that can be “engineered”. With no surprise to readers of this blog, key metrics that are used throughout the book are availability, latency and throughput :  User-facing serving systems … generally care about availability, latency, and throughput”. As an engineer, one knows that a price must be paid for any complex requirement, including performance and reliability. Failures and mistakes are managed in a cold engineering approach, without exaggerating the requirements. In many places, the book warns us against looking for “over-safe, over-expensive” approaches, but to keep looking for a balance. 

2.2. Distributed Systems

As mentioned in the introduction, this is a book about distributed systems: “As SREs, we work with large-scale, complex, distributed systems.” Distributed systems engineering is a wonderfully exciting discipline, albeit a difficult one. It takes time, energy and dedication to become an expert in such matter. It also takes humility and curiosity.  “Systems are complex. It’s quite likely that there are multiple factors, each of which individually is not the cause, but which taken jointly are causes. Real systems are also often path-dependent, so that they must be in a specific state before a failure occurs.

The emphasis on “distributed” here means that one must learn to step back and look at the whole system, not the platform or the component that is actively being built. One of my key action as a manager is to promote the “One System” culture, which is the understanding that we (as an organization) are building ONE large, complex and distributed system. This requires observing and sharing : “There are many ways to simplify and speed troubleshooting. Perhaps the most fundamental are: Building observability — with both white-box metrics and structured logs — into each component from the ground up. Designing systems with well-understood and observable interfaces between components”.

The emphasis about logging and configuration management is obvious throughout the book. There are far too many practical recommendations, such as the use of different verbosity levels, to be reproduced here in this summary. Some of them I learned many years ago at Bellcore, when working with true distributed system experts such as my friend Quoc-Bao Nguyen. Being able to debug, re-parameter, re-configure without shutting the system down has been a good practice for decades. On the other hand, distributed issues that used be very hard to solve, such as distributed locking, are now easier because of the wealth of scalable and robust open-source solutions. Still, one must understand the complex nature of distributed system and learn about tested protocols: “Whenever you see leader election, critical shared state, or distributed locking, we recommend using distributed consensus systems that have been formally proven and tested thoroughly. Informal approaches to solving this problem can lead to outages, and more insidiously, to subtle and hard-to-fix data consistency problems that may prolong outages in your system unnecessarily”.
The same remark can be made about the complexity of distributed large-scale storage. One must understand and accept the CAP theorem and learn to live with eventual consistency (or learn to live with right-time versus real-time) : “A growing number of distributed datastore technologies provide a different set of semantics known as BASE (Basically Available, Soft state, and Eventual consistency). Datastores that support BASE semantics have useful applications for certain kinds of data and can handle large volumes of data and transactions that would be much more costly, and perhaps altogether infeasible, with datastores that support ACID semantics”.

As mentioned earlier, the book is full of examples of self-adaptive mechanisms to make distributed systems more reliable. Techniques that are heavily used in network management such as exponential decay have obviously their place at the application level. Throttling is a good example of self-adaptation: “Client-side throttling addresses this problem. When a client detects that a significant portion of its recent requests have been rejected due to “out of quota” errors, it starts self-regulating and caps the amount of outgoing traffic it generates. Requests above the cap fail locally without even reaching the network

2.3 Distributed Systems Reliability at Google

Google has developed a wealth of know-how on distributed systems reliability. The goal of this book is to share some of that knowledge because it is widely applicable in a context larger than Google or the web community. Reliability is not a set of simple recipes; it is a discipline of redundant protection through multiple layers:  Given the many ways data can be lost (as described previously), there is no silver bullet that guards against the many combinations of failure modes. Instead, you need defense in depth. Defense in depth comprises multiple layers, with each successive layer of defense conferring protection from progressively less common data loss scenarios”. Some of the techniques, such as idempotent scripts, are now well understood in the DevOps community, but this book offers a comprehensive survey which should be useful to most of us who are not true geeks.

The model of the SRE team is to build a small team of highly trained ops specialists with enough software engineering expertise to understand and manage the systems: Ultimately, SRE’s goal is to follow a similar course. An SRE team should be as compact as possible and operate at a high level of abstraction, relying upon lots of backup systems as failsafes and thoughtful APIs to communicate with the systems” … “In order to work at scale, teams must be self-sufficient. Release engineering has developed best practices and tools that allow our product development teams to control and run their own release processes”. This dual DevOps capability is critical to perform “ProdTests” (tests on the production environment). A SRE team should be autonomous in its decision process, but it works together with the rest of the organization (SRE teams supplements the engineering organization, it does not replace it).

How does one develop these highly trained specialists? By doing and mostly by learning from previous failures: “There is no better way to learn than to document what has broken in the past. History is about learning from everyone’s mistakes. Be thorough, be honest, but most of all, ask hard questions”.  The customer focus is critical and visible throughout the book. SRE must understand the user perspective and take extra care of customer-facing systems : “The frontend infrastructure consists of reverse proxy and load balancing systems running close to the edge of our network. These are the systems that, among other things, serve as one endpoint of the connections from end users (e.g., terminate TCP from the user’s browser). Given their critical role, we engineer these systems to deliver an extremely high level of reliability.

One key responsibility of SREs is to assist release engineering, that is, how to test and release a new version of a software component or service. There is a lot of emphasis on gradual releases, which goes hand in hand with continuous release: “Almost all updates to Google’s services proceed gradually, according to a defined process, with appropriate verification steps interspersed. A new server might be installed on a few machines in one datacenter and observed for a defined period of time. If all looks well, the server is installed on all machines in one datacenter, observed again, and then installed on all machines globally.”

2.4 Automation and Monitoring

As the book points out, the simplest and most powerful principles behind reliability engineering are automation and monitoring. Automation is the natural solution to the efficiency/sublinear requirement made earlier, and the best way to remove a large part of human errors which are still the roots of most system failures. “There’s an additional benefit for systems where automation is used to resolve common faults in a system (a frequent situation for SRE-created automation). If automation runs regularly and successfully enough, the result is a reduced mean time to repair (MTTR) for those common faults.” The history of SRE, as told by Ben Treynor, comes from the automatization of operations by software engineers. Automation is a continuous task, not a once-for-all milestone. “Automation code, like unit test code, dies when the maintaining team isn’t obsessive about keeping the code in sync with the codebase it covers.”

Operations need to be automated and monitored. Monitoring is, as told earlier when we talked about the necessity of observation, the heart of reliable operations. “Whether it is at Google or elsewhere, monitoring is an absolutely essential component of doing the right thing in production. If you can’t monitor a service, you don’t know what’s happening, and if you’re blind to what’s happening, you can’t be reliable.” Monitoring and automation go hand in hand. Alerts produced by monitoring should trigger actions which are as automated as possible – back to the autonomic computing ambition – human intervention should be seen as the last resort option. Automation should come with heavy instrumentation to feed the monitoring and ensure proper execution (Automation is no guarantee against human errors, at first – only continuous improvement produced fail-proof scripts). “Running a service with a team that relies on manual intervention for both change management and event handling becomes expensive as the service and/or traffic to the service grows, because the size of the team necessarily scales with the load generated by the system.”

The regular practice of logging and monitoring provides massive opportunities of using Artificial Intelligence and Machine Learning for operations. Using algorithm ranging from classical time series data mining (e.g., with Splunk) to more advanced machine learning techniques to detect correlations and patterns, one may generate advanced alerts when the system requires tuning to avoid entering a hazard zone in the future. Automation and monitoring act on the two components of reliability: reducing MTBF (less error with automation, early detection and prevention with monitoring) and reducing MTTR (better diagnosis with monitoring and faster reparation through automation).

2.5. The Most Common Cause of Loss of Service Availability is Change

The SRE team at Google has found that roughly 70% of outages are due to changes in a live system: Most outages are caused by some kind of change — a new configuration, a new feature launch, or a new type of user traffic — the two teams’ goals are fundamentally in tension”. This tension between the need for change (and for continuous delivery) and the need for reliable operations is wonderfully covered by the great book Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations”, by Nicole Foresgreen, Jez Humble and Gene Kim. The only path forward is to automate change and treat it as seriously as possible at the same time. Companies that succeed to deliver high-availability reliable quality of service are those who perform frequent changes.

API management is a core dimension of modern system engineering, including from an operations perspective as is illustrated many times in this book. API modularity is critical for keeping systems simple. API change management is equally critical and helps to understand while reliability engineering and software engineering are mutually dependent: “While the modularity that APIs offer may seem straightforward, it is not so apparent that the notion of modularity also extends to how changes to APIs are introduced. Just a single change to an API can force developers to rebuild their entire system and run the risk of introducing new bugs.”

A key component of change management is Capacity planning.  This topic is covered in depth in this book, with a systemic vision (not only one platform as a time, but with a global perspective). “Good capacity planning can reduce the probability that a cascading failure will occur. Capacity planning should be coupled with performance testing to determine the load at which the service will fail.” System capacity planning is a complex modeling and engineering task that may leverage techniques from operations research (cf. the Auxon solver that formulates a giant mixed-integer or linear program based upon the optimization request received from the Configuration Language Engine: “At Google, many teams have moved to an approach we call Intent-based Capacity Planning. The basic premise of this approach is to programmatically encode the dependencies and parameters (intent) of a service’s needs, and use that encoding to autogenerate an allocation plan that details which resources go to which service, in which cluster.”

Capacity planning needs to encapsulate reliability engineering and propose designs that can operate under the simultaneous occurrence of planned and unplanned outage. This type of thinking makes “N+2” configurations a classical pattern to follow. Capacity planning should ensure that an acceptable service may still be delivered when the two largest instances are unavailable. Similarly, capacity planning needs to be mixed with load balancing policies. The book covers load balancing policies (such as Weighted Round Robin) in depth because they are a critical part of QoS : “Avoiding overload is a goal of load balancing policies. But no matter how efficient your load balancing policy, eventually some part of your system will become overloaded. Gracefully handling overload conditions is fundamental to running a reliable serving system.”  Queues management policies (FIFO, LIFO, CoDel, …) have similarly a key impact on QoS, which is precisely the topic of the OAI simulations that I ran 10 years ago (cf. “Self-adaptive and self-healing message passing strategies for process-oriented integration infrastructures”)

2.6 The Practice of the “Error Budget”

The concept of the “error budget” is a key contribution of Google SRE practice, where the operations organization finds a “QoS homeostasis” by balancing its operations requirement with the current state of its services. The aversion to change and the rigorous check-up of all necessary tests is not cast in stone but adapts to the current quality of service as seen by the user. The SRE team “balances reliability and the pace of innovation with error budgets (see “Motivation for Error Budgets”), which define the acceptable level of failure for a service, over some period. …As long as the service hasn’t spent its error budget for the month through the background rate of errors plus any downtime, the development team is free (within reason) to launch new features, updates, and so onIf the error budget is spent, the service freezes changes (except urgent security and bug fixes addressing any cause of the increased errors) until either the service has earned back room in the budget, or the month resets.”
The error budget must be managed from a customer-centric perspective. The “unavailability budget” is computed as the outage time that is seen by the end user: “Measuring error rates and latency at the Gmail client, rather than at the server, resulted in a substantial reduction in our assessment of Gmail availability, and prompted changes to both Gmail client and server code. The result was that Gmail went from about 99.0% available to over 99.9% available in a few years.”

There is an implicit but crucial point here: the outage/failure/error is no longer a “bad thing” that must be avoided at all costs, it is an “expected part of the process of innovation”, something that must be managed rationally.  Launches are necessary, but the SRE team has multiple ways of controlling the launch process : “Google defines a launch as any new code that introduces an externally visible change to an application. Depending on a launch’s characteristics — the combination of attributes, the timing, the number of steps involved, and the complexity — the launch process can vary greatly. According to this definition, Google sometimes performs up to 70 launches per week.”

Among the variables that can be acted, the SRE team must schedule the planned downtimes and the disaster recovery tests. Google spends a lot of energy on simulations of crashes and live drills. As we shall see later, the only way to develop and certify recovery capabilities is to test them regularly. Living with a failure is a part of what it means to run operations. As pointed out in the book, most people reaction when facing a failure is to start troubleshooting instantly and spend all their energy to find a root cause as quickly as possible. The proper set of action (as with a human accident) is to make first the system work as well as possible in a degraded mode. This is the type of behavior that you can only get with practice. Once again, failures are a natural occurrence of large scale complex systems. Minimizing impact of these failures is the role of reliability engineering.

2.7 Technical Debt and Recovery Capabilities

Complex systems that evolve constantly are poised to produce unnecessary weight and complexity that should be assessed and removed periodically. This is the principle of “taking constant care of one’s garden” (removing weeds, pruning, digging, …), that is, removing “technical debt” in the words of information systems : “Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.”  This is also why the authors warn us against “over-engineering” reliability :  that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs.”

When dealing with large-scale systems, data engineering becomes a critical skill. In the past 20 years I have lived through a number of very significant outages. Throughout theses (tough) experiences, I have noticed that the time to recover, to move or to install new data sets is always much longer than planned. In most cases, the recovery plan is executed (hence the disaster is averted) but the total recovery time is longer than what was planned initially, because unit operations are longer. Google report similar stories in this book: “Processes and practices applied to volumes of data measured in T (terabytes) don’t scale well to data measured in E (exabytes). Validating, copying, and performing round-trip tests on a few gigabytes of structured data is an interesting problem.”

A fair amount of pages deals with backup and recovery. Obviously, what matters is recovery, backups are just a tool. As a subcomponent of disaster recovery, data recovery must be tested regularly. Designing data recovery is a hard problem. More generally, designing reliable distributed systems is hard. As mentioned earlier, the book points out the necessity of deep system skills and expertise: “Today, we hear a brazen culture of “just show me the code.” A culture of “ask no questions” has grown up around open source, where community rather than expertise is championed. Google is a company that dared to think about the problems from first principles, and to employ top talent with a high proportion of PhDs.”

The proposed approach is to first create SRE teams with the proper mix of expertise: “To this end, Google always strives to staff its SRE teams with a mix of engineers with traditional software development experience and engineers with systems engineering experience.”  Then for these teams to succeed, it is critical that the senior management recognizes reliability and quality of service as a strategic imperative for the company. In all companies that I know, there are many “operations anonymous heroes” that keep the systems running as well as possible. What distinguished the digital champions such as Google is the recognition that these individuals and teams play a major role in the value creation of the company and deserve a just recognition: “Another way to get started on the path to improving reliability for your organization is to formally recognize that work, or to find these people and foster what they do — reward it”.

2.8  SRE : a Team with End-to-End Responsibility About Operations

SREs are hybrid production teams, which combine ops and software engineering capabilities (they aim at making Google systems run themselves) with a dual ambition of customer satisfaction and continuous innovation. “SRE is concerned with several aspects of a service, which are collectively referred to as production. These aspects include the following: System architecture and interservice dependencies; Instrumentation, metrics, and monitoring; Emergency response, Capacity planning, Change management; Performance: availability, latency, and efficiency”.

SRE teams follow many of the DevOps principles (software engineering and operations skills working as a team, involvement of IT function in each phase of a system’s design and development, heavy reliance on automation, …). SRE teams are associated to a service (or a service domain) : they own the responsibility of the run for this service: “In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).” 

Somewhere in the book, the authors notice that most concepts, principles and technique reported here are not special but rather part of a well-accepted state-of-the-art practice. However they also notice that a surprising number of ops team do not take these practices seriously, such as capacity planning, resulting in unnecessary failures.

As stated earlier, SREs make a heavy use of blameless postmortems as a tool for training and evangelization. “Google’s Postmortem Philosophy: The primary goals of writing a postmortem are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.” Blameless here means that the root cause analysis focus “on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior”.  Postmortems are not meant to be kept for the SRE team but to be shared as extensively as possible in the company.

Another key ritual of the SRE are the production meetings, where SRE team members orchestrate the necessary knowledge sharing between all concerned stakeholders (including product owners and software architects) : “In general, these meetings are service-oriented; they are not directly about the status updates of individuals. The goal is for everyone to leave the meeting with an idea of what’s going on — the same idea. The other major goal of production meetings is to improve our services by bringing the wisdom of production to bear on our services”. A key role in the SRE team in the Launch Coordination Engineer, which requires “strong communication and leadership skills” to bring everyone together. In a true DevOps spirit, when the operational load is too heavy, product development teams should contribute so that SRE keep a balance between incident management and continuous improvement. System architects should attend these production meetings regularly to collect up-to-date availability, latency and throughput which are absolutely necessary for reliable system engineering. Architecture schemas that are purely functional lead to disastrous failures in operations.

2.9 The Lean and TQM roots of Google SREs

Throughout the book, the influence of Total Quality Management (e.g., Deming) and Lean Management is quite visible. There is a constant effort to organize work so as to avoid overloads and work with a regular activity flow. Edward Deming famous quote “cherish your mistakes” finds many echoes here: “Google operates under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.”

Running complex operations, especially disaster recovery and incident management, is more efficient with a playbook (lean “standard”) : “When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.” Another lean influence is the importance of “system understanding” and making sure that indicators are picked scarcely and wisely. What matters here is the ability for the team members to evaluate and reason properly about a system’s health: “Ineffective troubleshooting sessions are plagued by problems at the Triage, Examine, and Diagnose steps, often because of a lack of deep system understanding.”

The system engineering culture has many aspects in common with Lean SixSigma, namely the heavy use of statistics including the use of distributions rather than mean values: “We generally prefer to work with percentiles rather than the mean (arithmetic average) of a set of values.  Doing so makes it possible to consider the long tail of data points, which often have significantly different (and more interesting) characteristics than the average”. Performance and reliability engineering require taking more than the average case into account.

Another lean trait is the search for constant elimination of dead weight (muda in the lean sense) which we mentioned earlier : “The term “software bloat” was coined to describe the tendency of software to become slower and bigger over time as a result of a constant stream of additional features.” “Software bloating” or “feature creep” is a common plague of software systems that is a clear instance of the Tragedy of the Commons. Adding more features always comes from a local view, where the benefit of the new feature is amplified and the systemic cost of adding to the whole system is reduced if not ignored. To encourage the team to adopt a “one system” mindset forces to re-evaluate with less emphasis on the new feature and more on the global resource consumption. Many resources such as available time (in an ops planning), bandwidth, or memory capacity lead to a “tragedy of the commons” situation where the sum of “small local incremental usages” translates into a global over-consumption.

3. Conclusion

Although this book review is actually quite shallow compared to the depth of the book, it is already substantial compared to what people expects when discussing digital transformation. My intent here is to underline the importance of execution with any digital transformation. As stated in the introduction and wonderfully explained in “The Web Giants”, quality of service is a crucial part of the customer experience. A company cannot be a digital leader without mastering the skills and the practices that are described in this book. The illustration on the right is taken from the Twitter launch of the NATF report on Artificial Intelligence and Machine Learning. What it says is that the Google SRE book should be part of the skill set of any companies that aim to develop a leading service based on data and AI. Engineering matters.
Site Reliability Engineering is foremost an ambition about people and culture. Here is a very short summary about an ideal “End to End” operations organization:
  • It is made of small autonomous teams that orchestrates, for instance through production meetings as well as postmortems, all stakeholders involved with the delivery of one (end-to-end) service.
  • The team owns the responsibility of this service: availability (reliability), latency & throughput , capacity planning, change management, monitoring,  incident management and recovery.
  •  This should be a customer-centric organization with is both measure-driven (analytics) and adaptive (flexible). Its operations guideline should fluctuate using an “error budget” to deliver stable quality of service and “change throughput” at the same time.
  • This team is necessarily “progress-centric” and operates under continuous improvement using postmortems and root causes analysis. The team has the diverse skill set to own and understand the complexity of the systems that it operates.
  • The organization strives towards as much automation as possible. This requires both to enroll software engineering skill and to let them operate on production systems.
This book tells mostly about reliability engineering but, as shown in the previous section, it is also a textbook about distributed systems engineering. Here is a short set of principles for building reliable distributed systems that I have extracted from this reading and which I will comment in a future post (the reference to “Biology of Distributed Information Systems” – the title chosen for this blog 10 yeas ago – is quite deep) :
  1. Leverage APIs to develop a multi-modal cell structure with variable rate of change, when change flows from the outside towards the inside of the whole system (like a cell)
  2. Organize complexity in layers where advanced systems are backed up by simpler “life-support” systems, the lower the complexity the higher the availability (biomimicry)
  3. Develop an event-driven flow architecture to design reactive systems that are scalable, reliable and open
  4. Use self-monitoring and “digital twin” to provide self-adaptation, self-optimization and self-healing
  5. Journey towards abstraction to move to “server less” systems (necessary to combine SaaS, cloud and on-premice operations)
  6. Leverage lean thinking to deliver robustness and flexibility through more available capacity
  7. Don’t fight CAP (theorem) and develop a reliable, high-availability, eventually consistent data architecture (think of data consistency as a movie rather than a snapshot).

Sunday, July 22, 2018

Managing Complexity and Technical Debt : A Model for Quantifying Systemic Analysis

1. Introduction

Today’s post is a summer recreation/musing about how to model the effect of complexity and technical debt, in the same spirit of a the previous “Sustainable IT Budget in an equation” post. I used the word “recreation” to make it clear that this is proposed as “food for thoughts”, because I have worked hard to make the underlying model really simple, without the intent of accuracy. I have worked on IT cost modelling since 1998 and have removed layers of complexity and precision over the years to make this model a communication tool. You may see the result, for instance, in my lecture at Polytechnique about IT costs.

This post is a companion to the previous “Sustainable Information Systems and Technical Debt” with a few (simple) equations that may be used to explain the concepts of sustainable development  and complexity management to an audience who loves figures and quantified reasoning. It proposes a very crude model, that should be used to explain the systemic behavior of IT costs. It is not meant to be used for forecasting.
What is new in this model and in this post, is the introduction of change and the cumulated effect of complexity. The previous models looked at the dynamics of IT costs assuming an environment that was more or less constant, which is very far from the truth, especially in a digital world. Therefore, I have extended this simple model of IT costs into two directions:
  • I have introduced a concept of « decay » to represent the imperative for change. I am assuming that the value delivered by an IT system decreases over time following a classical law of “exponential decay”. This is a simple way to represent the need for « homeostasis », which is the need to keep in synch with the enterprise’s environment. The parameter for decay can be adjusted to represent the speed at which the external environment evolves.
  • I have also factored the effect of cumulated complexity into the model, as a first and simple attempt to model technical debt. This requires modelling the natural and incremental rise of complexity when IS evolves to adapt to its environment (cf. previous point) – as noticed before, there is a deep link between the rate of change and the importance of technical debt – as well as the representation of the effort to clean this technical debt.

This model makes use of Euclidean Scalar Complexity(ESC), a metric developed with Daniel Krob and Sylvain Perronet to assess integration complexity from enterprise architecture schema. ESC is easy to compute and is (more or less) independent from scale.  The technical contribution of this new model and blog post is to propose a link between ESC and technical debt. Although this link is somehow naïve, it supports “business case reasoning” (i.e., quantifying common sense) to show that managing technical debt has a positive long-term return on investment.

This post is organized as follows. Section 2 gives a short summary of asset management applied to information systems, in other words, how to manage information systems through replacement and renovation rates. This is the first systemic lesson: investments produce assets that accumulate and running costs follow this accumulation. The proposed consequence is to keep the growth and the age of the assets under control. Section 3 introduces the effect on computing (hardware) resource management on IT costs. We make a crude distinction between legacy systems where hardware is tied to application in a way that required a complete application re-engineering to benefit from better generations of computing resources, as opposed to modern virtualized architecture (such as cloud computing) that supports the constant and (almost) painless improvement of these resources. This section is a simplified application of the hosting cost model proposed in my second book, which yields similar results, namely that Moore’s law benefits only show in IT costs if the applicative architecture makes it possible. Section 4 introduces the need for change and the concept of exponential decay. It shows why managing IT today should be heavily focused on average application age, refresh rate and software asset inertia. It also provides quantitative support for concepts such as multi-modal Information Systems and Exponential Information Systems. Section 5 is more technical since we introduce Information System complexity and the management of technical debt from a quantified perspective. The main contribution is to propose a crude model about how complexity increases iteratively as the information system evolves and how this complexity may be constrained through refactoring efforts at the enterprise architecture level. Although this is a naïve model, at a macro scale, of technical debt, it supports quantified reasoning which is illustrated throughout this blog post with charts produced with a spreadsheet.

2. Asset Management and Information Systems

We will introduce the model step by step, starting with the simple foundation of asset management. At its simplest, an information system is a set of software assets that are maintained and renewed. The software process here is very straightforward: a software asset is acquired and integrated into the information systems. Then it generates value and costs: hosting costs for the required computing assets, support costs and licensing costs. The previously mentioned book gives much more details about the software assets TCO; this structure is also found in Volle  or Keen . The software lifecycle is made of acquisition, maintenance, renewal or kill.

Since the main object of the model is the set of software assets, the key metric is the size of this asset portfolio. In this example we use the Discounted Acquisition Costs model, that is the sum of the payments made for the software assets that are in use (that have not been killed) across the information system’s history.  The IS budget has a straightforward structure : we separate the build costs (related to the changes in the asset portfolio : acquisition, maintenance, etc.) and the run costs (hosting, support and licensing). Run costs are expressed directly as the product of resource units (i.e. the size of the IS / software assets) and unit costs (these unit costs are easy to find and to benchmark).

The model for the Build costs is more elaborate since it reflects the IS strategy and the software product lifecycle.  To keep things as simple as possible we postulate that the information system strategy in this model is described as an asset management policy, with the following (input) parameters to the model:
  • The total IT budget
  • The percentage of Build budget that is spent on acquiring new assets (N%)
  • The percentage of Build budget that is spent on retiring old assets (K%)
  • The percentage of Build budget that is spent on renewals (R%)

With these four parameters, the computation of each yearly iteration of the IT budget is straightforward. The run costs are first obtained from the IS size of the previous year. The build budget is the difference between total IT budget and run costs. This build budget is separated into four categories: acquiring new assets, killing existing (old) assets, replacing old assets with newer ones (we add to the model an efficiency factor which means that renewals are slightly more efficient than adding a brand-new piece of software) and “maintenance”. Maintenance (for functional or technical reasons) is known to produce incremental growth (more-or-less, it depends on the type of software development methodology) which we can also model with a parameter.

The model is, therefore, defined through two key equations which tells how the asset portfolio changes every year in size (S)  and age (A). Here is a simplified description (S’ is the new value of “assets size”, S is the value for the previous year):
  1. Growth of the assets:  S’ =  S -  Build x K% + Build x A% + Build x (1 – K% – A% – R%) x G%
  2. Ageing of the assets:  A’ = (A + 1) x (S – Build * (1 – N% - R% - K%)) / S’

Measuring software assets with Discounted Acquisition Costs has benefits and drawbacks. The obvious benefit is that it is applicable to almost all companies. The value that is used for discounting with age has a small effect on the overall simulation (and what will be said in the rest of the post). Typical values are between -3% to -10%. The drawback is that money is a poor measure of complexity and richness of the software assets. A better alternative is to use function points, but this requires a fair amount of efforts, over a long period of time. When I was CIO of Bouygues Telecom, I was a strong proponent of function points but I found it very hard to make sure that measurement was kept simple (at a macro scale) to avoid all the pitfalls of tedious and arguable accounting. What I have found over the years is that it is almost impossible to use function points without strong biases. However, as soon as you have a reasonable history, it works very well for year-by-year comparisons. Used at a macro scale, it also gives good benchmarking “orders of magnitudes”.  There are many other alternatives (counting apps, databases, UI screens, ….) that suffers from the same benefits and drawbacks. Since I aim to propose something generic here – and because discussing with CFOs is a critical goal of such a model -, using DAC makes the most sense.

The following chart gives an example of simulating the same information system, with the same constant budget, under three strategies:
  • The “standard strategy” is defined by N% = 5%, K% = 3%, R% = 50%. With 50% of the project budget spent on renewal, this is already a strong effort to keep the assets under control. The imbalance between what needs to be added (5% of the budget to meet new needs and a small effort for decommissioning) is quite typical.
  • The “clean strategy” is defined by N% = 5%, K% = 8%, R%=60%. Here we see the effect of try to clean up the legacy (more renewal effort, and more decommissioning).
  • The “careless” scenario is defined by N% = 5%, K% = 3% and R% = 30%. This is not a large change compared to the standard one, less effort is made on renewing assets so that more money can be spent on adapting the existing assets to the current needs.

These three figures illustrate some of the key lessons that have been explained before, so I will keep them short:
  • Changes that may seem small in the strategy (the values from the three scenarios are not all that different) produce significant differences over the years. Especially the “Build/Run” ratio changes significantly here according to the strategy that is picked.
  • Beware of accumulation, both in size and ageing, because once the problem accumulates it becomes hard to solve without a significant IT budget increase.
  • The effects of an aggressive cleanup asset management strategy are visible, and pay for themselves (i.e., they help free build money to seize new opportunities), but only after a significant period of time, because of the multiplicative effects. Therefore, there is no short-term business case for such a strategy (which is true for most asset management strategies, from real estates to wine cellars).

As I will explain in the conclusion, showing three examples does not do justice to the modelling effort. It is not meant to produce static trajectories, but rather to be played with in a “what if” game. Besides, there are many hypotheses that are specific to this example:
  • The ratio between the Build budget and the value of the asset portfolio is critical. This reflects the history of the company and differentiates between companies that have kept their software assets under control and those where the portfolio is too large compared to the yearly budget, because accumulation has already happened.
  • This example is built with a flat budget, but the difference gets bigger if there are budget constraints (additional project money is “free money” whereas “less project money” may translate to forced budget increase because renewing the SW assets is mandatory for regulatory reasons … or because of the competition).
  • The constants that I have used for unit costs are “plausible orders of magnitudes” gathered from previous instances of running such models with real data, but each company is different, and the numbers do change the overall story.
  • We do not have enough information here to discuss about the “business case” or the “return on investment” for the “cleanup strategy”. It really depends on the value that is generated by IT and the “refresh rate” constraints of the environment, which is what we will address in Section 4.

3. Managing Computing Infrastructures to Benefit Moore’s Law

In the previous model I have assumed that there is a constant (cf. the unit cost principle) cost of hosting a unit of software asset on computing resources. This is an oversimplification since hosting costs obviously depends on many factors such as quality of service, performance requirement and software complexity. However, when averaged over one or many data centers, these unit costs tend to be pretty “regular” and effective to forecast the evolution of hosting. There is one major difference that is worth modelling: the ability to change, or not, the computing & storage hardware without changing the software asset. As explained in my second book,  Moore’s Law only shows in IT costs if it is leveraged”, that is if you take advantage of the constant improvement of hosting costs. To keep with the spirit of a minimal model, we are going to distinguish two types of software:
  • Legacy: when the storage/computing configuration is linked to the application and is not usually changed until a renewal (cf. previous section) occurs. In this case, unit costs tend to show a slow decline over the years, due to automation and better maturity in operation, which is very small compared to the cost decrease of computing hardware. This was the most common case 20 years ago and is still quite frequent in most large companies.
  • Virtualized: when a “virtualization layer” (from VM, virtual storage to containers) allows a separation between software and computing assets. This is more common nowadays with cloud (public or private) architecture. Because it is easier to shift computing loads from one machine to another, hosting costs are declining more rapidly. This decline is much slower than Moore’s law because companies (or cloud providers) need to amortize their previous investments.

This split is easy to add into our asset model. It first requires to track software assets with two lines instead of one (legacy and virtualized), and to create two set of unit costs (which will reflect the faster reduction for virtualized hosting cost, as well as the current difference which is significant for most companies since virtualized load tend to run on “commodity hardware” whereas legacy software often runs on specialized and expensive hardware (the most obvious example being the mainframe).

The software asset lifecycles are coupled as follows. First the kill ratio K% must be supplemented with a new parameter that says how much of the effort is made on the legacy portfolio compared to the virtualized one. Second, we assume here – for simplification – that all applications that are renewed are ported to a newer architecture leveraging virtualized computing resources. Hence the renewal parameter (R%) will be the main driver to “modernize” the asset portfolio.

The following curves show the difference between three strategies, similarly to the previous section. We have taken a generic example where the portfolio is balanced at first between legacy and virtualized computing architecture. In this run, the new ratio (N%) is raised to 10% to reflect a situation where more change is required. The difference is mostly about the renewal rate (respectively 40%, 50% and 30%) and the kill rate (respectively 3%, 5% and 3%). The figure shows the size of the two software asset portfolios (Virtualized and Legacy) as well as the average age.

The model was run here with conservative costs estimates on purpose. The reader is encouraged to run this type of computation with the real values of unit cost that s/he may observe in her company. Even with this conservative setting, the differences are quite significant:
  • A sustained renewal effort translates into a more modern portfolio (the difference in average age is qui significant after 8 years) and lower run costs. It is easy to model the true benefits of switching to commodity hardware and leveraging virtualization.
  • The difference in the Build/Run ratio is quite high, because the run costs are much lower once a newer architecture is in place.
  • These efforts take time. Once again, the key “strategic agility” ration is “build/run”. When the asset portfolio has become too large, the “margin for action” is quite low (the share of the build budget that can be freely assigned to cleanup or renewal once the most urgent business needs have been served).

To make these simulations easier to read and to understand, we have picked a “flat IT budget scenario”. It is quite interesting to run simulations where the “amount of build project” has a minimal value that is bound to business necessities (regulation or competition – as is often the case in the telecom market). With these simulations, the worse strategy translates into a higher total IT cost, and the gap increases over the years because of the compounded effects.

4. Managing Information Systems Refresh Rate to Achieve Homeostasis

I should apologize for this section title which is a mouthful bordering on pretentious … however I have often used the concept of “homeostasis” in the context of digital transformation because it is quite relevant. Homeostasis refer to the effort of a complex systems (for instance a living organism) to maintain an equilibrium with its environment. In our world of constant changes pushed by technology rapid evolution, homeostasis refers to the constant change of the information system (including all digital assets and platforms) to adapt to the changing needs of customers and partners. From my own experience as an IT professional for the past 30 years, this is the main change that has occurred in the past decade: the “refresh rate” of information systems has increased dramatically. This is an idea that I have developed in many posts from this blog.

This is also a key idea from Salim Ismail’s best seller book “Exponential Organizations” which I have used to coin the expression “Exponential Information Systems”. From our IT cost model perspective, what this says is that there is another reason – other than reducing costs and optimizing TCO – to increase the refresh rate. I can see two reasons that we can to capture: first, the value provided by a software asset declines over time because the environment changes. Second, no software asset is isolated any more, each is part of a service (or micro service) orchestration patterns, that comes with its own refresh rate constraints. I have found that the best way to express both principles is to postulate that the value provided by a software asset declines over time following a classical exponential decay law: 
                Value = V0 x exp(- lambda x time)

This formula is easy to introduce in our model, and it adds a penalty to ageing software that is independent from the previously mentioned TCO effects on licensing, maintenance, or hosting.  The decay parameter (lambda) is a way to tell the model that “software must change regularly” to cope with the constant changes of its environment.

I will not offer any simulation or charts here, because it is actually quite difficult to find data about this “decay” (need for change) and it makes little sense to invent one for a generic simulation. Also, not all software assets are equal with respect to exponential decay. Software-as-a-service (SaaS) is precisely an answer to this need for constant adaptation. Therefore, should we plan to run simulations seriously, we should use different decay constants for different classes of assets. There are plenty of software costs benchmarking materials that may be used to calibrate or evaluate the unit costs for the two previous models but evaluating the “rate of decay” is much more difficult and very specific to each industry.

On the other hand, exponential decay / forced rate of change is a really useful concept when planning about the long-term future of information systems. When one plays with the previous model augmented with decay, it becomes clear that the desired rate of change, which is a business imperative, is not compatible with the inertia that is exhibited by the model. This should be clear in the example shown in the previous section: the weight of the legacy portfolio is too high to obtain a homogeneous high refresh rate. In most large and established company, the Build/Run ratio is worse that what is proposed here, which means that the ability to increase the refresh rate is even worse.

This situation yields naturally to the concept of “multi-modal IT”, including the famous “bimodal IT” pattern.  A multi-modal information system is organized in an “onion structure” with a core and layers towards the periphery. Using API (Application Programming Interfaces), the onion may be organized so that the refresh rate is higher in the outside layers than in the core. This a direct application from biology and a pattern that is used in many organization theories such as betacodex. This is also the foundation for Exponential Information Systems: Use Enterprise Architecture and modularity to support different refresh rates for different parts of the information systems. The core/shared/pivot/common data model, that described the key business objects that are used throughout the multiple layers (multi-modal component) is a critical component of the Enterprise Architecture since it defines the agility through the API. A modular multimodal architecture is one that leaves most of the changes in the layers that were designed to accommodate high refresh rates. This means that modularity is a dynamic property of system architecture much more than a static one.

The lesson from this short section is that one should think really hard about the required refresh rates of sections of their information systems. For each functional domain, a target refresh rate (which is equivalent to a target average age for software assets) should be set. This target is both a business and technical imperative which should be based on requirements from customers and partners, as well as constraints related to software ecosystems. Many software environments, such as developing a mobile application, come with their own refresh rate constraints because the key features provided by the ecosystem change constantly … while upward compatibility is approximate at best. Similarly, the ambition to leverage artificial intelligence and machine learning for a given business domain should translate into setting a high refresh rate target. Keep in mind that adapting to the outside environment is a business imperative: if the lack of architecture modularity and the inertial of the accumulated weight of the asset portfolio prevent from upgrading the portfolio fast enough, the obvious solution is to grow this portfolio resulting in added complexity and IT costs.

5. Information Systems Complexity and Technical Debt

This last section introduces complexity and its effect on TCO in our IT cost model. We have seen throughout this post that the accumulation of weight is a burden, we shall now see that the accumulation of complexity is also a handicap that translates into additional costs, namely integration, testing and support costs. Obviously, the complexity of information systems reflects the complexity of their mission and the complexity of their environment. Somehow, this complexity is where part of the business value and differentiation is created. What we want to focus on here is the excess of complexity, which is precisely the definition of technical debt.

To introduce complexity in our model, we need a way to define and measure it. We shall use “Euclidean Scalar Complexity” because it is easy to understand and has a set of unique properties that makes it the best candidate for managing complexity at the Enterprise Architecture scale. The principle is quite simple. Assume that you have an architecture schema of your information systems, with boxes and links. Assume that you have a weight w(b) for each of the boxes, and that the existence of a link on your schema represents an interaction between the two subsystems represented by the box.  The weight could be the DAC measure of Section 2, the function points, or any additive metric that you like (i.e. w(a+b) = w(a) + w(b)). The Euclidean Scalar Complexity (ESC) of your architecture (abstraction of your system) is the square root of the sum of:  w(x) x w(y) for all pairs of boxes x, y that are either identical or joined with a link on the architecture schema. Normalized ESC means dividing the ESC value by the weight of the information systems, which yields a complexity ratio that we shall use in our model, a number between 0 and 1 (1 means that everything is connected and a value close to zero means that every piece is independent).

ESC is one of the few metrics that is “scale invariant”, which is the first requirement for working at the whole IS schema (each architecture schema is a crude abstraction) – see the 2007 paperComplexité des systèmes d’information: une famille de mesures de la complexité scalaire d’un schéma d’architecture” by  CASEAU Y., KROB D., PEYRONNET S. Being scale invariant means that if a change of scale is applied to the representation (more boxes and more links to represent the same information at a different scale), the complexity does not change. Another short introduction to ESC is provided in my Polytechnique lecture. There is no need to understand ESC in detail to see how it can be used to extend our IT cost model to manage technical debt, but the key insight that can help is to see that ESC is foremost a (static) modularity measure. ESC works well because it captures the benefits of Enterprise Architecture patterns such as gateways, integration buses, API encapsulation, etc. ESC is a great metric to capture the complexity of micro-service architectures.

As explained earlier, we do not suppose that the ideal information system has no complexity, but rather that for a given set of assets, there exists two complexity ratios, Cmin and Cmax, that represent on the one hand the minimal achievable ESC (i.e. rebuilding the IS from zero using the maximal possible modularity) and on the other hand, the complexity that one would get by randomly adding and integrating software assets incrementally. We used the normalized ESC, so these two numbers are between 0 and 1. Not all architecture problems are equal: some information systems are easy to build which translates into Cmin and Cmax being close. On the other hand, for many systems, the difference between Cmin, Cmax and C (the actual ESC ratio) is quite high. These two anchors makes possible the definition of technical debt as: w(S) x (C – Cmin) / (Cmax – Cmin)

Introducing technical debt into our cost model means two things:
  • Measuring the impact of this additional complexity on costs.
  • Defining and measuring what “reducing the technical debt” (i.e., managing complexity) may mean, in terms of effect and cost.

The first part is easy, considering the simplicity of our model and the fact we are just looking for simple orders of magnitude. In a same way that we have take the effect of aging into account for licensing, we upgrade the following formulas:
  • Integration costs are proportional to the ESC : for a project with no complexity at all it would be zero (negligible compared to the cost of building / buying the new component) whereas for a truly complex (C = 1.0) system, integration (and testing) costs are equal to the acquisition costs. This is very debatable (because it is so simple and because integration costs may be even worse) but it gives a first plausible value.
  • Support costs are also impacted: a fraction of support costs is proportional to C. This fraction is different for each information systems and depends on the nature of support activities (for instance, level 1 support is less impacted than level 3 support). Thus, this fraction is left as a model parameter.

The second part is the true extension of the model and what makes this post a new contribution to IT cost modelling.  
  • Complexity without any refactoring effort evolves naturally towards Cmax as an asymptotic value (which is the definition of Cmax). The speed of evolution depends on how much of the information system is renewed each year, so the resulting complexity is a weighted average of the previous one and Cmax, where the weights are respectively the size of what is untouched in the portfolio and the size of what is added.
  • Refactoring is seen as a fraction of the renewal budget applied to reducing the complexity. The effect is described by a power law (declining return of the invested money). The maximal effect (the asymptotic value) is getting the complexity to Cmin (also by construction). I have used a power law with degree 4 which seems to reproduce the empiric observation that the refactoring efforts have a law of strongly diminishing returns (the last 60% benefits cost 80% of the effort).

The following illustration compares three IT strategies while taking technical debt into account. In this example we assume Cmin = 0.3 and Cmax = 0.6, with C = 0.5 at present time. The new parameters compared to section 1 are the amount of refactoring (F%) and how much of the kill effort is targeted towards legacy (small effect since the kill effort is small here).
  1. The “standard strategy” is defined by N% = 5%, K% = 3%, R% = 40% and F%=15%. Here 40% of the project budget is spent on renewal, and 15% of that money is reserved for refactoring. 70% of the kill effort is targeted towards legacy. The ESC ratio evolves from 0.5 to 0.489.
  2. The “intense strategy” is defined by N% = 5%, K% = 5%, R%=50% and F= 30% with 80% of kill effort put on legacy. This is a more sustained effort (since money applied to refactoring is not applied to adding new software assets). The complexity ration evolves from 0.5 to 0.464.
  3. The “careless” scenario is defined by N% = 5%, K% = 3% and R% = 30% and clean = 5%. These values are quite representative of most IT strategies (cf. what we said earlier, 30% of renewal is already the sign of a managed portfolio, many companies suffer from worse accumulation). Here the complexity moves from 0.5 to 0.509

These simulations are only proposed as an illustration. What can be learned through repeated simulations is similar to what we said in Section 2:
  • Efforts to clean up the portfolio, to keep the age and the complexity under control, are long-term efforts but they have lasting effects.
  • Accumulation in complexity, as well as accumulation in weight, has the dual effect of increasing the run costs and reducing the agility (the effort to change the information system to add new capacities becomes higher).
  • Contrary to section 2 and 3 where the models may be calibrated with well-known unit costs, measuring the negative effect of complexity and the positive effect of refactoring is hard. What is proposed here is a simple and conservative model that has the benefit of showcasing the effects of both but finding the right constant /parameters to match this model to your reality requires efforts and will lead to some “guesstimates”.
  • The key benefit of keeping weight and complexity under control is to allow for a higher refresh rate, which is itself a business imperative which can be modelled through “value decay” (Section 4). In order to simulate the value created by an aggressive “keep the complexity under control” strategy, you need to work under the situation of strong need for refresh (high decay). The weight of technical debt becomes a huge burden as soon as the environment requires to constantly update the information system.

6. Conclusion

The title for this blog post is: “Managing Complexity and Technical Debt: A Model for Quantifying Systemic Analysis”. There are two key ideas  here : the first one is “quantifying” and the second one is “systemic”. Both are related to the fact that this proposed model is foremost a communication tool. I decided to build a quantified model because this is the best way to communication with business managers. It does not mean that this model should be used for forecasting future IT costs, it is by far to simplistic. Using models and simulation is a great way to communicate with CFOs; they are even more effective if they rely on simple parameters/KPI such as unit costs that can be evaluated through benchmarking (cf. the Gartner paper on how CIOs can work with CFOs). To make this communication tool work for your case requires using your own data, as was said repeatedly.

The second benefit of working with a model is the benefit of simulation, which is a free benefit once you have built your own spreadsheet. Being able to play “what if” scenarios is critical to help your stakeholders understand the systemic nature of the IT costs:
  • Delays: good decisions may have long delays before showing their positive consequences
  • Accumulation: this is the main difficulty of managing information systems, once the cost problems are visible, they are much harder to solve.
  • Amplification (reinforcement): bad decisions produce waste that adds to the complexity and the overall weight, producing further problems along the way.

Although the topic is very different from the topics that I have covered using GTES (Game Theoretical Evolutionary Simulation), one can find the same pattern : a simple model is used to create simulations that are organized into a serious game from which systemic insights may be gained. As was stated earlier, the goal here is not to define the “best IT asset management strategy” using a crude model, but to gain a systemic understanding why long-term complexity-constrained asset management policies must be implemented.

Here are the conclusions that I can draw from playing with this generic model (more detailed conclusions would require working a on a specific instance):
  1. Keep the age of your software assets under control
  2. Keep the scope as small as possible (but not smaller, to paraphrase Einstein)
  3.  Constantly refactor your information systems, at different scales.

Technorati Profile