Saturday, October 20, 2007

Sustainable Enterprise Architecture

After five years spent as a CIO, I have developed the following conviction:
[1] Major architecture re-engineering projects are fuelled by pain (one could say : no pain, no gain :)).

Enterprise Architecture (EA) projects (I should rather say programs, as in a family of projects) require an unusual amount of effort and alignment throughout a long period of time. The alignment is even more difficult to achieve than the sustained level of effort, especially with a large IT organization. Alignment here means the fact that a large group of people decide to make sub-optimal choices, from their own viewpoint, to achieve a larger-scope goal. It may be strange way to look at alignment but it makes sense: if the logical choice for each actor was to move towards the same direction, there would be nothing to talk about :)

Introducing an EA scheme (what we French call "urbaniser" the information system) usually occurs because a blazing limitation of the IS has been found. It is too slow, not flexible or agile enough, not reliable enough and, most often, too expensive, etc. The level of pain is necessary to break through a "decision/action" threshold, since there are obvious risks. From a technology perspective, an EA relies on an integration infrastructure (ESB, EAI, ETL, and so on). From a culture perspective, new concepts and a new vocabulary are introduced.

What happens if the program is successful? After a while (it could be a long while :)) the pain recedes. As a consequence the alignment starts to weaken. This is not simply an internal/organizational issue for the IT department. Deploying an EA approach is a corporate endeavour, which requires from all business division a common wish to build a global system. The alignment here means that each division is ready to relinquish some of its interests for the common good.

Because of this weakening of the common resolve, the master plan becomes too heavy to carry and one returns to (some of) the previous faulty behaviors that caused the pain in the first place… I have been thinking about this for the last three years and I have come to build a second conviction:

[2] Service Oriented Architecture is the sustainable approach to develop a "well-architectured" information system over time.

This statement probably sounds lame and dull to anyone familiar with the SOA (Service Oriented Architecture) concept. All possible "claims to fame" have already been made when SOA is concerned :). I should first state a few caveat:

  • I do not mean SOA as a technical architecture framework (Web Services, ESB, …) but as a governance method for building reusable information system assets. I do not want to dwell on this today, I'll come back to it some other time (one may look at the discussion about the SOBA acronym to grasp this ambiguity).
  • I do not mean a unique, shared, common architecture. As I explained in my first book, I am a strong believer in diversity. Actually, one of the first papers to talk about "sustainable enterprise architecture", from Marten Schoenherr and Stephan Aier, precisely considered sustainability a benefit of a distributed approach. I refer my customary readers to page 98 of my book where I also develop this idea (e.g., the main benefit of SOA compared to earlier approach, such as EAI, is the ability to decentralize the EA program).
  • I really push the analogy with "sustainable development" to the core: a sustainable EA approach is one that produced benefits without requiring so many efforts from the culture, the people, the organization that it stops whenever the actors have the freedom to do so. This is really about people, and especially the relationship between business process owners and their IT providers.
  • I am discussing about a large-scale enterprise and its information system as a whole. I consider the problem of "SOA at a departmental scale" solved (this is illustrated by the existence of so many successful implementations …). The sustainable alignment of a medium-size information system is not such a formidable task :)

Hopefully my second book will be available soon to English-speaking readers (since I have finished the translation). They will see that one of my central theme is that a "well designed IS" is a corporate responsibility, not something that may be left to the CIO. The CIO may take the leadership for a "special re-engineering" program/effort, but this cannot last. Eventually it is a matter of management culture (unless the CIO wants to become a "dictator" but he or she usually gets fired quickly if this temptation is too strong :)).

Pierre Bonnet recently opened a web site about this very topic: His book about the same topic will be out next month. The web site is really interesting, together with the companion site about the Praxeme method. If you go and read through it (which I encourage you to do :)), you may think that it develops a similar line of ideas (obviously, with more details and more thought-though principles). I actually agree with everything … but I do not think that it reflects truly what sustainable development is really about: people. I am personally a big fan of the ACMS approach (Agility Chain Management System). Unfortunately, (or fortunately, since no so many people may understand what it :)), this is not where I see the issue for deploying a SOA Enterprise Architecture in a sustainable way. There (rightfully so) a lot of talk about SOA governance nowadays. Unfortunately it remains complex and abstract, whereas the issue is the appropriation from all stakeholders in the company. I will return to this topic in further postings, since I believe that this (SOA governance) is the key to agility. I fear for those who will promise agility from the sole technical merits of a SOA architecture.

It turns out that there is a totally different meaning for "sustainable IT architecture" ! If one looks at the electric consumption of a data center, it is raising dangerously over the years (with respect to the double issue of energy price increase and greenhouse gas emissions). Electric consumption here includes both the powering of the computers and their cooling. Both tend to be proportional to the square of the processor frequency (one way to look at it, although the resistance decreases with the smaller scale designs). Both tend equally to be proportional to the amount of computation that is made, which is clearly growing fast in most companies.

This is why Google is seeing energy consumption as a key issue. For instance, read this newspaper article to find out about Urs Hoelzle approach to reduce server electricity footprint (more technically-savvy readers may look at this :)). Since then, Hoelzle has said Google is looking into neutralizing its carbon emissions by the end of the year.

The link with architecture is as follows. The simplest way to increase the computing power without increasing the consumption is to use massive parallelism. I don't have time to go into details today. One may look at the StorageMojo web site to get a lot of interesting stuff.

I have just finished Ray Kurzweil book "the Singularity is near" (2005). As usual, this is a fascinating book, especially from this "sustainable development of IT" perspective. From a general perspective, it is a refreshing view from a "technology optimist" which offers a clear break from the prophets of doom. As far as computing is concerned, Kay Rurzweil offers hope for software designers to be able to use much, much faster hardware (although, if you read the book, you'll see that the name may no longer be appropriate :)), something that I am dreaming off each time I run of my "game theory simulation" :)

Ray Kurzweil's optimism does not cancel the validity of Google concerns (different time scale). One might say, then, that a sustainable architecture needs to run on a grid-like structure (or any other form of massively parallel system architecture).

Thursday, October 11, 2007

Lean Information Systems

Lean Manufacturing is a powerful concept, which is often misunderstood. It was made popular by Toyota’s implementation and Taiichi Ohno’s vision (one of Toyota charismatic leaders). A very simple way to explain what it is would be to compare two production shops:

  • one shop is organized so that each machine is run at optimal capacity, in its best operating conditions. Buffers are introduced and the transport between machines is a little longer (so set up the machine optimally)

  • the second shop is organized so that the flow is shortened as much as possible. Buffers are reduced (and eliminated as much as possible) and the transport is optimized. The consequence is that each machine is no longer working optimally. Some are underutilized and others are working in operations mode that do not yield the best productivity.

What does Lean Manufacturing (and experience) say ? Obviously the first shop costs less to operate (cost for producing one unit) on paper, but unless it operates in a ideal world with no variations at all, it actually costs more in real life. The second approch costs less from an inventory perspective, but mostly it is more flexible (with respect to priority changes) and more robust (with respect to load variations).

Let us now consider two information systems, from our scope of large-scale, distributed information systems (many parallel nodes running business processes) :

  • The first one has been designed so that each node is running close to its optimal capacity. A node here may be a group (cluster, farm, blade) of servers that run services which are the elementary components of the business processes. The computing power of the node is designed so that the node is running at 85% capacity when the load is full (i.e. when the business processes are running at their maximal expected load).

  • The second one has been designed to speed up the process running and to avoid "queuing waiting time". Hence the computing power of the node is adjusted so that the average utilization ratio is closer to 50%

Here also, the first data center is clearly cheaper to build than the first one. The second one has a few advantages: better SLA (service level agreement) may be promised to the customer (tighter = faster garanteed response time), and the upgrade process (when the company grows) may be planned in a more regular way) ... but let us assume that these are not compelling advantages. That is, let us suppose that the customer accepts the two different SLA:

  • in the first case, the SLA is such that the target response time will be obtained in 98% of the time with regular business conditions.

  • in the second case, the SLA is also such that the response time will be obtained in 98% of the time (hence, a smaller number than the first one).
I have made some interesting computing experiments last month to see how these two data centers would behave when "a little stress occurs". Stres here may come from one node unavailability, from a process overload, or from a higher-than-usual variation in the processing load. Anyone who has any experience with operations will recognize that these are the common issues of day-to-day production life.

These experience were reported in a talk that I gave at the "Colloque d'Automne du LIX", from which I have extracted the last slide:

You may find the complete presentation on the CAL web site. To keep things simple, the curves describe the behavior of the systems (1) and (2), with different stress scenario. The different curves correspond to different strategies of "adaptive middleware" (recall that I have this interest for autonomic computing :)). What matters here is that the lower curve reproduces the strategy that ALL existing systems use today (first-come, first served). What you may see is a tremendous difference:
  • The lean IS (on the left) does actually very well under stress. Only the loss of a node creates a real problem (and it is not major, the SLA drops to 75%)
  • The loose IS (on the right) is definitely not robust. The stess conditions cause a significant drop of the SLA (down to 20% !).

There is another way to say it: if your IS is run in such a way that message queues are often full of pending requests, setting up a proper SLA is a very difficult job, because predicting the behaviour (response time) of an overloaded queing system is hard science. It is not enough to add reasonamble margins (such as, promise a 10 minutes response time because the average processing time is 1 minute).

There is nothing new here. This experiment confirms what experience or intuition shows. What is interesting (and what surprised me) is the HUGE difference that the computing experiment reveals.

I plan to do similar experiments within the (global) entreprise context. I need a model that links the behavior of the IS with that of the company itself. Fortunately, I can rely on the great work (and models) just released by the CEISAR.

The CEISAR is a French initiative, under the patronage of the Ecole Centrale, to create a repository of models and practical knowledge about Enterprise Architecture. A first gem is their global model (follow "main concepts" then "Core Business System" on their web site), an attempt to define Enterprise Architecture with 10 key concepts. Another extremely useful piece that is part of the first release is a document about entity modeling. In one of my books I complained that this type of knowledge was not accessible (and could only be obtained from experience). It is nice to see real-life-experts, such as Jean-René Lyon, share their knowledge about such topics.

I definitely plan to adhere to CEISAR's terminology and framework for my own future work about IS architecture. One of the most pressing issue (as I have already testified on this blog) is to build a framework/model to explain, discuss, simulate data distribution and synchonization protocol. The only way to make this a relevant topic is to keep a very broad perspective, that includes a model of the coupling between IS and business. The nice conclusion is that this type of work falls neatly between my two topics of interest (cf. my other blog) : IS efficiency and Enterprise efficiency.

Technorati Profile