Sunday, December 3, 2006

Autonomic Computing

Since "Autonomic Computing" is a key concept related to the topic of this blog, I have translated an extract from my book ("Urbanisation et BPM"). It is not the best reference on the topic :) but it will give an overview for readers who are not from the IT community. This extract was written in 2004 (therefore, it is no longer precisely up to date ...)


Autonomic computing (AC) is the name given to a large-scale research initiative by IBM, on the following: mastering the ever increasing complexity requires to make IT systems « autonomous ». More precisely, « autonomic » is defined as the conjunction of four properties :

(1) self-configuring : the system adapts itself automatically and dynamically to the changes that occur in its environment. This requires a « declarative » configuration, with a statement of goals and not a description of means. A good example of such a declarative statement is the use of SLA as parameters. When something changes, the system tunes its internal parameters to keep satisfying its SLA policy.

(2) self-healing : the management of most incidents is done automatically. The discovery, diagniosis and repair of an incident are made by the system itself, which supposes a capacity to reason about itself. Therefore, such a system holds a model of its own behaviour, as well as reasoning tools that are similar to so-called “expert systems”. This is often seen as the comeback of Artificial Intelligence, although with a rich model, simple and proven techniques are enough to produce an action plan from incident detection.

(3) self-optimizing : the system continuously monitors the state of its resources and optimizes their usage. One may see this as the generalization of load balancing mechanisms to the whole IT system. This requires a complex performance model, which may be used both in a reactive (balancing) and proactive (capacity planning) manner (cf. Chapter 8).

(4) self-protecting : the system protects itself form different attacks, both in a defensive manner, while controling and checking accesses, and in a proactive manner though a constant search for intrusion.

(note: To start with an excellent introduction to « autonomic computing », one must read the article « The dawning of the autonomic computing era » from A.G. Ganek and T.A. Corbi, which may be found easily on the Web)

The interest of these properties and their relevance to the management of IT systems are self-evident. One may wonder, however, why a dedicated research initiative is called for, instead of leveraging on the constant progress of the associated fields of computer science. The foundation of the Autonomic Computing initiative is the belief that we have reached a complexity barrier: the management cost of IT infrastructure (intallation, configuration, maintenance, …) has grown relentlessly with their size and power, it represents more than half of the total cost. Next generations of infrastructure will be even more complex and powerful. IBM’s thesis is that their management is possible only if it becomes automated. In the article that we just quoted, a wealth of statistics is given that shows the ever-increasing part of operational tasks, while at the same time the business reliance on IT system is equally growing, resulting into financial consequences of IT outage that are becoming disastrous.

Therefore, a breakthough is necessary to overcome this complexity barrier : in the future, IT systems need to manage and repair themselves automatically, with as little human intervention as possible. This may sound far too ambitious and closer to a “marketing public relation initiative” than a directed research project : after the eras of distributed computing, autonomous computing and ubiquitous computing, here comes autonomic computing. This initiative flows continuously from computer science research themes of the past 30 years. Some of the problems and solution directions are actually old. However, two paradigm shifts have occured. First, the center of attention has evolved in the last few years from software (component, application) to software system (hence the focus on enterprise archictecture and business process). In the 80s and 90s, the focus on problems similar to those of AC has given birth to the concept of intelligent agents, but their application to the full scale of corporate IT has proven to be difficult. From a CIO perspective, the focus on enterprise architecture and infrastructure is a welcome shift, especially since the research budgets are impressive. On IBM’s side, this is the main theme for a R&D budget of over 5 billions dollars. Most other major players also work on similar initiatives, even though the vocabulary may vary.

The second paradigm shift is the emergence of the biological model of incident processing. This is a departure from an endless search of methods that would produce software free from bugs and multiple back-ups that would guarantee a complete availability of infrastructures. Autonomic Computing applies to the “real world”, where software contains many defects and run on computers that experience all sorts of outage, and draws analogy from “living organisms” to deliver fault-tolerant availability. Living organisms cope with a large spectrum of aggressions and incidents (bacteries, viruses, …) with a number of techniques: redundancy, re-generation, self-amputation, reactive aggression, … etc. Similarly, autonomous IT systems need to be designed to perform in an adverse environment. The analogy and inspiration from biology is a long-lived trend in computer science. For instance, the use of “ evolution theory” as an optimization strategy has produced genetic algorithms or swarms approaches. To quote a NASA expert, who is applying the concept of swarm to micro-robots, some part of the design is replaced by “evolution as an optimization strategy”.

Even though there is a form of utopia in this field and many goals are definitely long-termed, this is not science-fiction. There is an evolutionary course towards autonomic computing and the first steps correspond to technologies that have already demonstrated in research labs. Actually, there is a symmetry in the founding argument: if we are not able to provide our new systems with more autonomy, there progress in scale and complexity will soon reach a manageability limit. It is, therefore, logical to bet on the forecoming availability of autonomic capabilities. The issue is not: “If the systems become autonomous”, is is “when” and “how”..

Automic Computing is not a philosophy, it is an attitude, according to the experts. Systems should be designed to be autonomous from the very early design, even though some of the choices and options may still be very rudimentary. An illustration drawn from the practice of real-time systems is the heartbeat principle. A heartbeat is a periodical simple signal that each component broadcast as a testimony to its status (alive, weakened, distressed, etc.). The « heartbeat » principle comes from the real-time systems community. For instance, it is used for NASA ‘s satellites under the « beacon » designation. On this topic, one may read, the papers from the NASA research center in the proceeding of the autonomic computing workshop that was part of EASE’04.

This proactive attitude seems relevant to the design of information systems. The paper form Ganek and Corbi makes a strong argument in favor of evolution, as opposed to revolution, based on a roadmap that goes from “basic” to “autonomous”. The progression is labeled through the steps: managed, predictive, adaptative and autonomous. It is completely relevant to the transformation of business process management, as described in Chapter 8.

AC and Enterprise Architecture

The foundations of autonomic computing are well suited to the field of Enterprise Architecture, since IT components make a large and complex system. Hence, many aspects of the AC approached are relevant to the re-engineering of IT. We shall now consider three: autonomic computing infrastructure, autonomic business process monitoring and “biological” incident management. There is a clear overlap, but we shall move from the the most complex to the more realistic.

The concept of autonomic IT infrastructure has received the most media attention since it is the heart of IBM strategy, being a cornerstone of the on-demand computing (cf. 11.3.2). An autonomic infrastructure is based on computing resource virtualization; it manages a pool of servers that is allocated to application needs on a dynamic basis. The global monitoring allows load balancing, reaction to incidents through reactive redistribution, and dynamic reconfiguration of resources to adapt to new needs. The ultimate model of such an infrastructure is the GRID model, which implies that grid computing is de facto a chapter of autonomic computing. A grid is a set of identical anonymous servers that are managed as one large parallel computing resource. The management of Grid computing has already produced valuable contributions to the field of AC. For instance, the most accomplished effort to standardize with “quality of service” means in the world of Web Services, WSLA, comes from the world of grid computing. One may read « Web Service Level Agreement (WSLA) Language Specification”, and see, for instance, how performance-related SLA are represented (under the measurement headins, 2.4.8). One may see the grid as a metaphor of tomorrow’s data center. Therefore, even though the grid is often reduced to the idea of using the sleeping power of a farm of PC during the night, most CIOs should look into this field as a guideline for their own infrastructures.

A grid is by construction a resource that may be shared. It has interesting properties of robustness, fault-tolerance and flexibility. It is, therefore, a convenient infrastructure to bridge the gap between the autonomic and the on-demand computing concepts. A “on-demand” infrastructure is a flexible computing infrastructure, whose capacity may be increased or decreased dynamically, according to the needs of the company’s business processes. Such an infrastructure may be outsourced, which yields the concept of “utility computing” as a service, or it may be managed as an internal utility. There is more to “on-demand computing” than the infrastructure aspects, which will be covered at the end of this chapter. The synergy with autonomic computing is obvious: the business flexibility requirements of the on-demand approach demand a breakthrough in terms of technical flexibility which translates into autonomic computing, and is well illustrated by the grid concept.

This vision is shared by most major players in the IT industry. However, we consider this as a long-term target, since the majority of today’s application portfolios in most companies may not be migrated smoothly onto this type of infrastructure. This does not mean, as was previously started, that some of the features are not available today. For instance, “blade server” infrastructures deliver some of the autonomous benefits as far as self-configuration, self-management and self-healing are concerned.

On a mid-term perspective, one may expect the field of autonomic computing to have an impact on business process management. The field of OAI (Optimisation of Application Integration, cf. chapter 8) is equally a long-term, complex research topic, but which will benefits from small-steps improvements. Two family of software tools are currently evolving towards autonomous behavior: integration middleware and monitoring software. BAM (business activity monitoring) software is integrating capabilities to model and to simulate the occurrence of business incidents. Adaptative middleware or “fault-tolerant” middleware are emerging. For instance, look for the Chameleon project, which belongs to the ARMOR approach (Adaptive, Reconfigurable and Mobile Objects for Reliability). Chameleon is an integration infrastructure which combines adaptability and fault-tolerance.

On a short-term perspective, one may apply the “biology metaphor” of autonomic computing to formalize the management of incidents, as it often occurs in the “real world”. To summarize, one may say that system operation relies on two visions: a mechanical and an organic one, to deliver the continuity of service that is required by its clients. The mechanical vision is based upon redundancy, with back-up and spare copies of everything. When something fails, the faulty component is replaced with a spare. Depending on the recovery time constraints, this approach translates into a disaster recovery plan, into clusters, fault-tolerant hardware, etc. For instance, one may look at the discourses from Carly Fiorina, HP’s former CEO, who focuses on mutualization, consolidation and virtualization under « Adaptative Enterprise » vocable. The same keywords appear in SUN’s or Veritas’s strategy about servers and data centers.

The organic vision is based upon alternate scenarios and re-routing. It is derived from an intimate knowledge of business and its processes. It consists of a network of “alternate sub-processes”. It requires a strong cooperation between operation managers, application developpers and business owners. Only a precise knowledge of the business and its priorities enable a transverse team to find valid “ alternate approaches”. We named this approach « organic operations », which may be described with the following goals :

  1. Create an operations model which supports the definition of scenarios and takes all operations tools and methods into account (some of which are represented with hatched boxes in the figure). Stating existing recovery procedure in a formal and shared manner is both the first stage of maturity and a possible step towards automation.
  2. Create multiple levels of “reflexes and reaction”, which are automated incident management rules. An interesting contribution from the biological metaphor is the separation between reflexes (simples, distributed) and reactions (centralized, require a « conscient » intelligence).
  3. Create the tools which allow us to make “intelligent” decisions. These may be representation/cartography tools (which is linked to the topic of business activity monitoring) or planning/simulation tools : what would happen if the message flow from component A was re-directed towards B ?

3 comments:

Yves Caseau said...

If you feel that the layount stinks ... you may be using IE 7. Strangely, editing and reading on blogger works much better with Mozilla than with IE ... surprise, surprise.
I apologize to all IE users, there is not much that I can do.
I had to stop using IE for the editing, all the font menus had disappeared !

Anonymous said...

The interest of these properties and their relevance to the management of IT systems are self-evident
--> u mean "non management" ?!
How do you define "management" : setting goals, and regularly checking how your people get closer to these goals, or setting tasks, and regularly checking if these tasks are properly done ?

Anonymous said...

Who knows where to download XRumer 5.0 Palladium?
Help, please. All recommend this program to effectively advertise on the Internet, this is the best program!

 
Technorati Profile