Biology of Distributed Information Systems: February 2007

I have defined previously the subject of this blog in a very general manner, trying to look at the “big picture” (almost from an epistemological point of view). The “big question” is : how much autonomy and “intelligence” needs to be fed into an information system in order to achieve the properties that are expected - resilience, performance, reactivity … and so forth. The reasoning is as follows. Large-scale information systems are, truly, complex systems which exhibit all the classical properties of such systems : fractal/recursive structure, non-linear behaviour when they receive an incoming flow of event (cf. the interesting amplification loops of acknowledge/resend messages due to fault-tolerant mechanisms), “emergence” of large-scale properties that are different from those at the component scales, etc. This is why I will use this blog as a forum to relate my (slow and progressive) journey into the world of complex systems and their theory.

Today I will look at the topic from the other side, taking my former “CIO hat”. I will describe five architectural/design challenges that face a modern (large) enterprise. The following is a list of the issues that I have been struggling with during my five-years tenure, and which, I believe, are relevant to many companies with similar scale (size of IS) and scope (role of IS within the business processes). This is not to say that these are “open problems”. Indeed, we found solutions to each of those, as did other corporations. On the other hand, no definite answer has been proposed yet :

these are ad hoc solutions and they leave some questions un-answered,
the state-of-the-art, as defined in books or architectural frameworks that are sold by consultant, is surprisingly shy on these topics,
the academic research is only touching the “surface” of these problems.

These are very practical questions, one of the goal of my research is to build a link with the more forward-looking / science fiction / philosophical discussion on autonomic and intelligent systems. As a matter of fact, one of my long term goal is to address these issues in my computational experiments, such as those I made for the OAI research :

Self-adaptive middleware: Supporting business process priorities and service level agreements
Advanced Engineering Informatics, Volume 19, Issue 3, July 2005, Pages 199-211

Here is a short description of each of them, I will return with a more detailed description on some other occasion.

(1) How does one guarantee the Quality of Service, defined at the business process level, from the system-level properties of the individual components? This is a key a challenge for any IS architecture framework, including service-oriented architecture (SOA). I have coined the OAI term to describe this issue and a simple description may be found on my web site (http://claire3.free.fr/yves_research.htm), while a more detailed presentation is included in the afore-mentioned AEI paper. This is clearly an open-ended issue, my preliminary work has led to more questions than answers.

(2) What is the proper approach to achieve resilience (lowest possible impact of system-scale failures of one or many components) ? This is where the biology comes in (cf. my previous message) : the mechanical view of robustness through redundancy (multiple copies and spare parts) shows its limit in the real world, and, most often, real life crisis are resolved through alternate scenarios (an “organic approach”). It turns out that there already exists a number of alternate approaches: an older system that has not been disconnected yet, a simplification of the business process that is still acceptable, a different component that may render a simplified version of the service, a raw computer utility/patch/batch that fixes 80% of the problem (measured in dollars) with a fraction of the effort (measured in function points), etc.

(3) What kind of data architecture is best suited to distribute business objects in a coherent and efficient way ? Business objects need to be distributed in a large-scale system for performance reasons, but they participate to business process executions, which require some kind of coherence. This leaves roughly three options:

Assume a separate mechanism that will ensure the coherence of the distributed objects (precisely, a distributed database system management system - DDBMS J). On a small-scale system, or with a homogeneous system, this is obviously the most logical approach. We sub-contract this coherence issue to another system, and run the business processes assuming that distributed objects are coherent. It turns out to be difficult (the so-called “snapshot problem” of DDBMS cannot be solved in its full generality) and quickly unpractical once one builds an information system out of COTS (commercial, of the the shelf) software components (heterogeneous).

Take responsibility of “business object distribution & coherence” as part as the business process management. In other words, ensure that the business process flow pushes all relevant updates so that it guarantees that, as far as the execution of the process is concerned, the objects always are in a coherent state. The synchronization of business process events and object management events is, indeed, a tricky issue.

Define an acceptable level of “chaos”, that is accept that complete coherence is not necessary ! This is actually a generalization of the previous approach, and is closer to reality (real large-scale systems survive with a fair amount of non-synchronization).

Finding a proper approach that is robust to errors, (i.e., contains a proper long running transaction mechanism) is, as far as I can see, both a truly practical question (this problem is everywhere, although many fail to recognize it) and a difficult one.

(4) What is the nature of service-level contracts which will yield the ability to evolve in a flexible and distributed manner? This is the quest for “upper compatibility” when designing service interface (for instance in a SOA approach) which should enable a truly distributed evolution of the IS as a whole. My experience is that this is a difficult topic, and that large IS projects are often necessary when one components is upgraded from one version to another – a lot of interface work and a lot of non-regression testing.

(5) How does one achieve a modular architecture that still takes advantage of the opportunity to share and reuse services? In the world of SOA, this translates into the definition a common, global directory of shared services. There is an obvious (and classical) tension between distribution and centralization. A common shared service is the opportunity to reduce cost, reduce size (hence complexity) and ensure more coherence. On the other hand, it may render the evolution of the system more difficult, and certainly requires mechanisms to operate the community of stakeholders (who use the service), to govern the roadmap of the shared component. It is also essential (as for large-scale software) to introduce some level of abstraction and encapsulation, which prevents a unique, common and shared directory of all services. The structure that defines a distributed directory of services with its proper governance mechanism (with visibility and decision rules) is yet to be defined to implement an enterprise-wide service-oriented architecture.

Since one of my current goals is to draw as much relevant information as possible from the literature on complex systems, I will post regularly my findings and try to explain how they related to these five questions.

Biology of Distributed Information Systems

Sunday, February 11, 2007

Five Challenges for Entreprise Architectures

My Links

Blog Archive

Other Blogs and Sites