Biology of Distributed Information Systems: 2007

Saturday, October 20, 2007

Sustainable Enterprise Architecture

After five years spent as a CIO, I have developed the following conviction:
[1] Major architecture re-engineering projects are fuelled by pain (one could say : no pain, no gain :)).

Enterprise Architecture (EA) projects (I should rather say programs, as in a family of projects) require an unusual amount of effort and alignment throughout a long period of time. The alignment is even more difficult to achieve than the sustained level of effort, especially with a large IT organization. Alignment here means the fact that a large group of people decide to make sub-optimal choices, from their own viewpoint, to achieve a larger-scope goal. It may be strange way to look at alignment but it makes sense: if the logical choice for each actor was to move towards the same direction, there would be nothing to talk about :)

Introducing an EA scheme (what we French call "urbaniser" the information system) usually occurs because a blazing limitation of the IS has been found. It is too slow, not flexible or agile enough, not reliable enough and, most often, too expensive, etc. The level of pain is necessary to break through a "decision/action" threshold, since there are obvious risks. From a technology perspective, an EA relies on an integration infrastructure (ESB, EAI, ETL, and so on). From a culture perspective, new concepts and a new vocabulary are introduced.

What happens if the program is successful? After a while (it could be a long while :)) the pain recedes. As a consequence the alignment starts to weaken. This is not simply an internal/organizational issue for the IT department. Deploying an EA approach is a corporate endeavour, which requires from all business division a common wish to build a global system. The alignment here means that each division is ready to relinquish some of its interests for the common good.

Because of this weakening of the common resolve, the master plan becomes too heavy to carry and one returns to (some of) the previous faulty behaviors that caused the pain in the first place… I have been thinking about this for the last three years and I have come to build a second conviction:

[2] Service Oriented Architecture is the sustainable approach to develop a "well-architectured" information system over time.

This statement probably sounds lame and dull to anyone familiar with the SOA (Service Oriented Architecture) concept. All possible "claims to fame" have already been made when SOA is concerned :). I should first state a few caveat:

I do not mean SOA as a technical architecture framework (Web Services, ESB, …) but as a governance method for building reusable information system assets. I do not want to dwell on this today, I'll come back to it some other time (one may look at the discussion about the SOBA acronym to grasp this ambiguity).
I do not mean a unique, shared, common architecture. As I explained in my first book, I am a strong believer in diversity. Actually, one of the first papers to talk about "sustainable enterprise architecture", from Marten Schoenherr and Stephan Aier, precisely considered sustainability a benefit of a distributed approach. I refer my customary readers to page 98 of my book where I also develop this idea (e.g., the main benefit of SOA compared to earlier approach, such as EAI, is the ability to decentralize the EA program).
I really push the analogy with "sustainable development" to the core: a sustainable EA approach is one that produced benefits without requiring so many efforts from the culture, the people, the organization that it stops whenever the actors have the freedom to do so. This is really about people, and especially the relationship between business process owners and their IT providers.
I am discussing about a large-scale enterprise and its information system as a whole. I consider the problem of "SOA at a departmental scale" solved (this is illustrated by the existence of so many successful implementations …). The sustainable alignment of a medium-size information system is not such a formidable task :)

Hopefully my second book will be available soon to English-speaking readers (since I have finished the translation). They will see that one of my central theme is that a "well designed IS" is a corporate responsibility, not something that may be left to the CIO. The CIO may take the leadership for a "special re-engineering" program/effort, but this cannot last. Eventually it is a matter of management culture (unless the CIO wants to become a "dictator" but he or she usually gets fired quickly if this temptation is too strong :)).

Pierre Bonnet recently opened a web site about this very topic: http://www.sustainableitarchitecture.com/home. His book about the same topic will be out next month. The web site is really interesting, together with the companion site about the Praxeme method. If you go and read through it (which I encourage you to do :)), you may think that it develops a similar line of ideas (obviously, with more details and more thought-though principles). I actually agree with everything … but I do not think that it reflects truly what sustainable development is really about: people. I am personally a big fan of the ACMS approach (Agility Chain Management System). Unfortunately, (or fortunately, since no so many people may understand what it :)), this is not where I see the issue for deploying a SOA Enterprise Architecture in a sustainable way. There (rightfully so) a lot of talk about SOA governance nowadays. Unfortunately it remains complex and abstract, whereas the issue is the appropriation from all stakeholders in the company. I will return to this topic in further postings, since I believe that this (SOA governance) is the key to agility. I fear for those who will promise agility from the sole technical merits of a SOA architecture.

It turns out that there is a totally different meaning for "sustainable IT architecture" ! If one looks at the electric consumption of a data center, it is raising dangerously over the years (with respect to the double issue of energy price increase and greenhouse gas emissions). Electric consumption here includes both the powering of the computers and their cooling. Both tend to be proportional to the square of the processor frequency (one way to look at it, although the resistance decreases with the smaller scale designs). Both tend equally to be proportional to the amount of computation that is made, which is clearly growing fast in most companies.

This is why Google is seeing energy consumption as a key issue. For instance, read this newspaper article to find out about Urs Hoelzle approach to reduce server electricity footprint (more technically-savvy readers may look at this :)). Since then, Hoelzle has said Google is looking into neutralizing its carbon emissions by the end of the year.

The link with architecture is as follows. The simplest way to increase the computing power without increasing the consumption is to use massive parallelism. I don't have time to go into details today. One may look at the StorageMojo web site to get a lot of interesting stuff.

I have just finished Ray Kurzweil book "the Singularity is near" (2005). As usual, this is a fascinating book, especially from this "sustainable development of IT" perspective. From a general perspective, it is a refreshing view from a "technology optimist" which offers a clear break from the prophets of doom. As far as computing is concerned, Kay Rurzweil offers hope for software designers to be able to use much, much faster hardware (although, if you read the book, you'll see that the name may no longer be appropriate :)), something that I am dreaming off each time I run of my "game theory simulation" :)

Ray Kurzweil's optimism does not cancel the validity of Google concerns (different time scale). One might say, then, that a sustainable architecture needs to run on a grid-like structure (or any other form of massively parallel system architecture).

Thursday, October 11, 2007

Lean Information Systems

Lean Manufacturing is a powerful concept, which is often misunderstood. It was made popular by Toyota’s implementation and Taiichi Ohno’s vision (one of Toyota charismatic leaders). A very simple way to explain what it is would be to compare two production shops:

one shop is organized so that each machine is run at optimal capacity, in its best operating conditions. Buffers are introduced and the transport between machines is a little longer (so set up the machine optimally)

the second shop is organized so that the flow is shortened as much as possible. Buffers are reduced (and eliminated as much as possible) and the transport is optimized. The consequence is that each machine is no longer working optimally. Some are underutilized and others are working in operations mode that do not yield the best productivity.

What does Lean Manufacturing (and experience) say ? Obviously the first shop costs less to operate (cost for producing one unit) on paper, but unless it operates in a ideal world with no variations at all, it actually costs more in real life. The second approch costs less from an inventory perspective, but mostly it is more flexible (with respect to priority changes) and more robust (with respect to load variations).

Let us now consider two information systems, from our scope of large-scale, distributed information systems (many parallel nodes running business processes) :

The first one has been designed so that each node is running close to its optimal capacity. A node here may be a group (cluster, farm, blade) of servers that run services which are the elementary components of the business processes. The computing power of the node is designed so that the node is running at 85% capacity when the load is full (i.e. when the business processes are running at their maximal expected load).

The second one has been designed to speed up the process running and to avoid "queuing waiting time". Hence the computing power of the node is adjusted so that the average utilization ratio is closer to 50%

Here also, the first data center is clearly cheaper to build than the first one. The second one has a few advantages: better SLA (service level agreement) may be promised to the customer (tighter = faster garanteed response time), and the upgrade process (when the company grows) may be planned in a more regular way) ... but let us assume that these are not compelling advantages. That is, let us suppose that the customer accepts the two different SLA:

in the first case, the SLA is such that the target response time will be obtained in 98% of the time with regular business conditions.

in the second case, the SLA is also such that the response time will be obtained in 98% of the time (hence, a smaller number than the first one).

I have made some interesting computing experiments last month to see how these two data centers would behave when "a little stress occurs". Stres here may come from one node unavailability, from a process overload, or from a higher-than-usual variation in the processing load. Anyone who has any experience with operations will recognize that these are the common issues of day-to-day production life.

These experience were reported in a talk that I gave at the "Colloque d'Automne du LIX", from which I have extracted the last slide:

You may find the complete presentation on the CAL web site. To keep things simple, the curves describe the behavior of the systems (1) and (2), with different stress scenario. The different curves correspond to different strategies of "adaptive middleware" (recall that I have this interest for autonomic computing :)). What matters here is that the lower curve reproduces the strategy that ALL existing systems use today (first-come, first served). What you may see is a tremendous difference:

The lean IS (on the left) does actually very well under stress. Only the loss of a node creates a real problem (and it is not major, the SLA drops to 75%)
The loose IS (on the right) is definitely not robust. The stess conditions cause a significant drop of the SLA (down to 20% !).

There is another way to say it: if your IS is run in such a way that message queues are often full of pending requests, setting up a proper SLA is a very difficult job, because predicting the behaviour (response time) of an overloaded queing system is hard science. It is not enough to add reasonamble margins (such as, promise a 10 minutes response time because the average processing time is 1 minute).

There is nothing new here. This experiment confirms what experience or intuition shows. What is interesting (and what surprised me) is the HUGE difference that the computing experiment reveals.

I plan to do similar experiments within the (global) entreprise context. I need a model that links the behavior of the IS with that of the company itself. Fortunately, I can rely on the great work (and models) just released by the CEISAR.

The CEISAR is a French initiative, under the patronage of the Ecole Centrale, to create a repository of models and practical knowledge about Enterprise Architecture. A first gem is their global model (follow "main concepts" then "Core Business System" on their web site), an attempt to define Enterprise Architecture with 10 key concepts. Another extremely useful piece that is part of the first release is a document about entity modeling. In one of my books I complained that this type of knowledge was not accessible (and could only be obtained from experience). It is nice to see real-life-experts, such as Jean-René Lyon, share their knowledge about such topics.

I definitely plan to adhere to CEISAR's terminology and framework for my own future work about IS architecture. One of the most pressing issue (as I have already testified on this blog) is to build a framework/model to explain, discuss, simulate data distribution and synchonization protocol. The only way to make this a relevant topic is to keep a very broad perspective, that includes a model of the coupling between IS and business. The nice conclusion is that this type of work falls neatly between my two topics of interest (cf. my other blog) : IS efficiency and Enterprise efficiency.

Sunday, June 24, 2007

Long term Research Agenda

I have published my research agenda on my web site.

I will resume my work on OAI this summer to prepare a lecture that I need to give in October. However, my first priority is to translate my book into English.

I am currently finishing my computational experiments on Social Networks. Contrary to what I have posted on the first message of this blog, there is a fair amount of common grounds between my two research topics. Mostly, the importance of information propagation ... how is it relevant to business performance, and how does information techhnology help to achieve it.

The idea that fast is better, that reactivity is a crucial quality ... is everywhere. However, as the study of LeanSixSigma from an operations research's viewpoint shows, this is not so obvious. Lean-ness, Speed, Reactivity come at a price. It is easy to be convinced that the price is small compared to the benefits, but this argument is rarely heard.

This is a thread of thought that I will follow ... It is related both to my old job about Information Systems Efficiency and to my new job of VP in charge of business process optimization and total quality management.

Saturday, May 5, 2007

Complex Systems and Autonomy

This blog is “moving” very slowly because I am currently much more involved with the theme “Information flow and Enterprise architecture”. I am currently working on Social Networks, and it turns out that there are a number of themes which are equally relevant to my two research interests.
For instance:

I will soon post a review of Robert Axelrod’s book “The Complexity of Cooperation”, which could appear in both my blogs. This book introduces a few key concepts, such as the use of game theory and genetic algorithms to study patterns of (stable) cooperation.
I attended a few sessions of the conference on “Complex Systems”, which I thought was appropriate considering that DIS (Distributed Information Systems) are indeed complex systems, and it turns out that social networks, including those that represent the information flows within a company, are the focus of “complex system approaches” as well. The 7th PCRD has made “socially intelligent IT” a hot priority.

My program nowadays is mostly to continue my education and read books. I will give a lecture in October for which I would like to extend the OAI research I have already talked about. More precisely, here are the two directions that I will pursue this summer when I resume my experimentation work:

Explore different scales and designs of systems to see what kind of influence they have on the behaviour of “smart” routing rules for process control (through message passing). Most of the experiences that I have made so fare try to reproduce business processes from Bouygues Telecom. Since I am using different “Enterprise Simulation Scenarios” in my SIFOA experiments, I plan to reuse this modelling effort. By scale & design, I mostly mean the number of processes, the number of tasks and systems involved in a process, and the interaction topology induced by the SOA architecture.
Try to model the handling of exceptions, that is, to model the alternate processing path which is used once a component is unavailable to deliver a business process. The difficulty here is that it is often an ad hoc approach (cf. my papers and my books). On the other hand, with the current trend of virtualization, more systematic alternate approaches will become available. I attended an interesting lecture on Autonomic computing from IBM which made the obvious-but-profound point that there is no autonomy without choice, and that choice comes (mostly, in the world of IT) from virtualization. There is little room for improving the handling of exceptional situations and failures with autonomic computing if a traditional architecture (hard links between dedicated resources) is used. On the other hand, in a virtualized world, there are many interesting options (hence a choice) when a server becomes unavailable.

This means that I am currently focusing on the points (1) and (2) of my previous list.
Following the suggestion from Cedric Nicolas, I just read “the age of spiritual machines” from Ray Kurzweil. It is a fascinating book, with very insightful thoughts, especially about consciousness and intelligence in a machine (or a network of). I am less enthusiastic about his prediction of the future (2009-2019-2099), but it is only a fourth of the book and the first three quarters are enlightening. I will return to the topic of consciousness when I post a review of Kevin Kelly’s book. I also finished a book from Jean-Claude Ameiesen, “La sculture du vivant” (following a suggestion from Pierre Haren) which is totally different (a biology book about programmed self-destruction of cells) but equally fascinating. This type of biology book makes it even more convincing that large scale man-made systems must draw their inspiration from living organisms. There are a number of mechanisms designed to yield both stability and flexibility/adaptability (somehow antagonistic) which are worth reproducing.

Sunday, February 11, 2007

Five Challenges for Entreprise Architectures

I have defined previously the subject of this blog in a very general manner, trying to look at the “big picture” (almost from an epistemological point of view). The “big question” is : how much autonomy and “intelligence” needs to be fed into an information system in order to achieve the properties that are expected - resilience, performance, reactivity … and so forth. The reasoning is as follows. Large-scale information systems are, truly, complex systems which exhibit all the classical properties of such systems : fractal/recursive structure, non-linear behaviour when they receive an incoming flow of event (cf. the interesting amplification loops of acknowledge/resend messages due to fault-tolerant mechanisms), “emergence” of large-scale properties that are different from those at the component scales, etc. This is why I will use this blog as a forum to relate my (slow and progressive) journey into the world of complex systems and their theory.

Today I will look at the topic from the other side, taking my former “CIO hat”. I will describe five architectural/design challenges that face a modern (large) enterprise. The following is a list of the issues that I have been struggling with during my five-years tenure, and which, I believe, are relevant to many companies with similar scale (size of IS) and scope (role of IS within the business processes). This is not to say that these are “open problems”. Indeed, we found solutions to each of those, as did other corporations. On the other hand, no definite answer has been proposed yet :

these are ad hoc solutions and they leave some questions un-answered,
the state-of-the-art, as defined in books or architectural frameworks that are sold by consultant, is surprisingly shy on these topics,
the academic research is only touching the “surface” of these problems.

These are very practical questions, one of the goal of my research is to build a link with the more forward-looking / science fiction / philosophical discussion on autonomic and intelligent systems. As a matter of fact, one of my long term goal is to address these issues in my computational experiments, such as those I made for the OAI research :

Self-adaptive middleware: Supporting business process priorities and service level agreements
Advanced Engineering Informatics, Volume 19, Issue 3, July 2005, Pages 199-211

Here is a short description of each of them, I will return with a more detailed description on some other occasion.

(1) How does one guarantee the Quality of Service, defined at the business process level, from the system-level properties of the individual components? This is a key a challenge for any IS architecture framework, including service-oriented architecture (SOA). I have coined the OAI term to describe this issue and a simple description may be found on my web site (http://claire3.free.fr/yves_research.htm), while a more detailed presentation is included in the afore-mentioned AEI paper. This is clearly an open-ended issue, my preliminary work has led to more questions than answers.

(2) What is the proper approach to achieve resilience (lowest possible impact of system-scale failures of one or many components) ? This is where the biology comes in (cf. my previous message) : the mechanical view of robustness through redundancy (multiple copies and spare parts) shows its limit in the real world, and, most often, real life crisis are resolved through alternate scenarios (an “organic approach”). It turns out that there already exists a number of alternate approaches: an older system that has not been disconnected yet, a simplification of the business process that is still acceptable, a different component that may render a simplified version of the service, a raw computer utility/patch/batch that fixes 80% of the problem (measured in dollars) with a fraction of the effort (measured in function points), etc.

(3) What kind of data architecture is best suited to distribute business objects in a coherent and efficient way ? Business objects need to be distributed in a large-scale system for performance reasons, but they participate to business process executions, which require some kind of coherence. This leaves roughly three options:

Assume a separate mechanism that will ensure the coherence of the distributed objects (precisely, a distributed database system management system - DDBMS J). On a small-scale system, or with a homogeneous system, this is obviously the most logical approach. We sub-contract this coherence issue to another system, and run the business processes assuming that distributed objects are coherent. It turns out to be difficult (the so-called “snapshot problem” of DDBMS cannot be solved in its full generality) and quickly unpractical once one builds an information system out of COTS (commercial, of the the shelf) software components (heterogeneous).

Take responsibility of “business object distribution & coherence” as part as the business process management. In other words, ensure that the business process flow pushes all relevant updates so that it guarantees that, as far as the execution of the process is concerned, the objects always are in a coherent state. The synchronization of business process events and object management events is, indeed, a tricky issue.

Define an acceptable level of “chaos”, that is accept that complete coherence is not necessary ! This is actually a generalization of the previous approach, and is closer to reality (real large-scale systems survive with a fair amount of non-synchronization).

Finding a proper approach that is robust to errors, (i.e., contains a proper long running transaction mechanism) is, as far as I can see, both a truly practical question (this problem is everywhere, although many fail to recognize it) and a difficult one.

(4) What is the nature of service-level contracts which will yield the ability to evolve in a flexible and distributed manner? This is the quest for “upper compatibility” when designing service interface (for instance in a SOA approach) which should enable a truly distributed evolution of the IS as a whole. My experience is that this is a difficult topic, and that large IS projects are often necessary when one components is upgraded from one version to another – a lot of interface work and a lot of non-regression testing.

(5) How does one achieve a modular architecture that still takes advantage of the opportunity to share and reuse services? In the world of SOA, this translates into the definition a common, global directory of shared services. There is an obvious (and classical) tension between distribution and centralization. A common shared service is the opportunity to reduce cost, reduce size (hence complexity) and ensure more coherence. On the other hand, it may render the evolution of the system more difficult, and certainly requires mechanisms to operate the community of stakeholders (who use the service), to govern the roadmap of the shared component. It is also essential (as for large-scale software) to introduce some level of abstraction and encapsulation, which prevents a unique, common and shared directory of all services. The structure that defines a distributed directory of services with its proper governance mechanism (with visibility and decision rules) is yet to be defined to implement an enterprise-wide service-oriented architecture.

Since one of my current goals is to draw as much relevant information as possible from the literature on complex systems, I will post regularly my findings and try to explain how they related to these five questions.

Biology of Distributed Information Systems