Biology of Distributed Information Systems: 2014

Friday, October 24, 2014

Big Data hides more than one paradigm shift

This post originates from a report which I wrote this summer for the NATF, which proposes a summary of what the ICT commission learned from its two-years cycle of interviews about Big Data. The commission decided to investigate about the impact of Big Data on French economy in 2012. Big Data is such a popular and hyped topic that it was not clear, at first, if a report would be necessary. So many books and reports have been published in the past two years (see the extract from the bibliography in the next section) that it made little sense to add a new one. However, throughout our interviews, and thanks to the FoE conference that NATF co-organized last year with NAE in Chantilly – which included Big Data as one of its four topics – we came to think that there was more to say about the topic than the usual piece about new customer insights, data scientists, the internet of things and the opportunities for new services.

Today I will focus on two ideas that may be characterized as paradigm shifts. I will keep this post short so it may be seen as a “teaser” for the full report which should be available soon. The first paradigm shift is a new way to analyze data based on systemic cycles and the real-time analysis of correlation. The old adage “correlation is not causation” is made “obsolete” because data mining is embedded into an operational loop that is judged, not by the amount of knowledge that is extracted, but by the dollar amount of new business that is generated. The second paradigm shift is about programming: Big data entails a new way to produce code in a massively distributed environment. This disruption comes from two fronts: the massive volume of data requires to distribute both data and procedure, on the one hand, and algorithmic tuning needs to be automated, on the other hand. Algorithms are grown as much as they are designed, they are derived from data though machine learning.

The full report contains an overview of what Big Data is because a report from NATF needs to be self-contained, but this is not necessary for a blog post. I assume that the reader has some familiarity with the Big Data topic. Otherwise, the Wikipedia page is a good place to start, followed by the upcoming bibliography entries, first of which Viktor Mayer-Schönberger and Kenneth Cukier’s book.

1. Big Data – A Revolution That Will Transform How We Live

This is the title of the book from Viktor Mayer-Schönberger and Kenneth Cukier, which covers the most famous paradigm shift of Big Data, which is its ability to transform our lives, from hard science, such as medicine, to marketing. The paradigm shift comes from the combination of what technology makes possible today – the ability to analyze very large amount of heterogeneous data in a very short amount of time – and the availability of the relevant data, which are the traces of our lives that have become digital. Thanks to the web, to smartphones, to technology which is everywhere in our lives and objects, there is a continuous stream of information that describes the world and our actions. Big Data may be described as the Information Technology which is able to mine this “digital logs” and produce new insights, opportunities and services. The constant improvement of technology (from Moore’s Law about processing to Kryder’s Law about storage) is matched by the increase in digital details about our lives. New connected objects, sensors and the growth of IoT (Internet of Things) mean that we are only seeing the beginning of what Big Data will be able to do in the future.

One of the reason for not discussing these themes further is that the book from Viktor Mayer-Schönberger and Kenneth Cukier covers them very well so that I encourage you to read it. The other reason is that there are many other sources that develop these theses. Here is a short extract from our report’s bibliography :

[1] Commission Anne Lauvergeon. Un principe et sept ambitions pour l’innovation. 2013.
[2] John Podesta & al. Big Data : Seizing Opportunities, preserving values. Executive Office of the President, May 2014.
[3] François Bourdoncle. “Peut-on créer un écosystème français du Big Data ?”, Le Journal de l’Ecole de Paris n°108, Juillet/Aout 2014.
[5] Viktor Mayer-Schönberger, Kenneth Cukier. Big Data – A Revolution That Will Transform How We Live, Work and Think. John Murray, 2013.
[8] Gilles Babinet. L’ère numérique, un nouvel âge de l’humanité : Cinq mutations qui vont bouleverser votre vie. Le Passeur, 2014.
[10] Phil Simon. The Age of The Platform – How Amazon, Apple, Facebook and Google have redefined business. Motion Publishing, 2011.
[12] IBM Global Business Services, « Analytics : Real-world use of big data in telecommunications – How innovative communication service providers are extracting value from uncertain data”. IBM Institute for Business Value, Avril 2013.
[13] Thomas Dapp. “Big Data – The untamed force”, Deutsche Bank Research, May 5, 2014.
[15] David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. “The Parable of Google Flu: Traps in Big Data Analysis”
[16] Tim Harford. “Big data: are we making a big mistake?”, Financial Times, March 28^th, 2014.
[19] Octo Technology. Les géants du Web : Culture – Pratiques - Architecture. Octo 2012.
[21] Tony Hey, Stewart Tansley, Kristin Tolle (eds). The Fourth Paradigm – Data-Intensive Scientific Discovery. Microsoft Research, 2009.
[22] Max Lin. “Machine Learning on Big Data – Lessons Learned from Google Projects”.
[24] Michael Kopp. “Top Performance Problems discussed at the Hadoop and Cassandra Summits”, July 17, 2013.
[25] Eddy Satterly. « Big Data Architecture Patterns ».
[26] Paul Ohm. “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization”. UCLA Law Review, Vol. 57, p. 1701, 2010
[27] CIGREF, « Big Data : La vision des grandes entreprises », 2013

2. Big Data as a new way to extract value from data

As introduced earlier, the key idea is to forget about causation, or trying to extract knowledge from data mining. A large part of Viktor Mayer-Schönberger and Kenneth Cukier’s book (chapter 5) is dedicated to the difference between causation and correlation. But what we found when we heard about “real big data systems”, such as those from Google or Criteo, is that these systems are anything but static data mining systems that aimed at producing knowledge. They are dynamic systems that constantly evolve and “learn” from the data, in a controlled loop. Most big data statistical tools, such as logistic regressions, or machine learning algorithms are still looking for correlations, but these correlations are not meant to hold intrinsic value, they are input for action, whose effect is measured in real-time. Hence knowing the “why” of the correlation, knowing if there is a causation, a reverse causation, a complex dependency circle which is the signature of complex systems … does not really matter. Nor does it matter (as much as in the past) to be assured that the correlation is stable and will last in time. The correlation detection is embedded into a control loop that is evaluated though the overall process financial result.

This Big Data approach towards data mining is about practice and experiments. It is also broader in scope than statistics or data mining, since the performance comes from the whole “dynamic system” implementation. It is also easy to deduce from this, which we do in our report with more details, that building such a Big Data system is a team effort, with a strong emphasis on technology and distributed systems and algorithms. From a marketing perspective, the goal is no longer to produce “customer knowledge” – which meant to understand what the customer wants – but to build an adaptive process which leads to better customer satisfaction – which is actually less ambitious. If we consider the Walmart example that is detailed in the previously mentioned book [5], analyzing checkout receipts produces correlations that need not to be thought as “customer insights”. There is no need to find out why there is a correlation between purchases of diapers and beer packs. It is enough to put these two items close by and see if sales improve (which they did). In the virtual world of the Web, testing these hypotheses becomes a child play (no physical displacement necessary).

There is a natural risk of error if one takes these correlations out of their dynamic-loop context and tries to see them as prediction. Our different experts, including the speakers invited for the 2013 Frontiers of Engineering conference – in particular Jeff Hammerbacher from Cloudera et Thomas Hofmann from Google – were adamant about the fact that “Big Data produces information that one does not really understand”, which creates the risk of poor utilization. This is similar in a sense to the phenomenon of « spurious correlations », which says that one analyses a large cloud of data points with a very large number of variables, one finds statistically a large number of correlations without any meaningful significance. A great example of avoiding such pitfall is given by the “Google Flu Trends” (GFT) story. When they analyzed search requests that used works linked to Flu, Google researchers found that they could forecast flu epidemics propagation with a good level of accuracy. This claim was instantly absorbed, amplified and orchestrated as a proof of Big Data greatness. Then more detailed analysis [15] [16] showed the limits and the shortcoming of GFT. The article that is publish on Harvard’s blog [15] is actually quite balanced. Although it is now clear that GFT is not a panacea and shows more errors than other simpler and more robust forecasting methods, the articles also states that : « The initial vision regarding GFT – that producing a more accurate picture of the current prevalence of contagious diseases might allow for life-saving interventions – is fundamentally correct, and all analyses suggest that there is indeed valuable signals to be extracted ».

3. Data is the new code

This great catch phrase was delivered to us by Henri Verdier, one of the many experts that was interviewed by the ICT commission. When Google’s teams look at a new startup, they compute the valuation mostly from the volume and the quantity of data that has been collected, with much less regard for the code that has been developed. Data valuation comes both from the difficulty to collect the data and its estimated future usage potential. Code is seen as an “artefact” that is linked to data, which is both destined to change and easy to replace. In this new world of Big Data, code is conceptually less important than the data it applied to. It is less important because it changes constantly, because it is made of simple sub-linear algorithms (the only ones that can be run onto petabytes of data) and because it is the result of a learning loop (simple algorithms in their principles, but zillions of parameters that require to be fine-tuned through experiments). To caricature the reasoning, Google could tell to these young startups : “I will buy your data and I will re-grow the code base using our own methods”.

This new way of programming does not apply only to new problems and new opportunities! This approach may be used to re-engineer more classical “information systems” through the combined application of commodity computing, massively parallel programming and open source data distribution software tools. This combination helps win one or two orders of magnitude with respect to cost, as was shown to us with numerous examples. In the previously mentioned book [5], one may learn about the VISA example, where Big Data technology was used to re-build an IT process with spectacular gains in cost and throughput. This “new way of programming”, centered on data, may be characterized in three ways:

Massively parallel programming because of the distribution of very large amount of data. The data distribution architecture becomes the software architecture because, as the volume grows, it becomes important to avoid “moving data”.
Sub-linear algorithms (whose compute time grows slower than the amount of data that they process) play a key role. We heard many great examples about the importance of such algorithms, such as the use of Hyperloglog counters in the computation of Facebook social graph diameter.
Algorithms need to be adaptive and tuned incrementally from their data. Hence machine learning becomes a key skill when one works on a very large amount of data.

During the 2013 FoE conference, Thomas Hoffman told us that “Big data is getting at the core of computer science”. This means that all current problems that receives the attention of today computer scientists, such as artificial intelligence, robotics, natural language processing, behavioral learning, and so on, all require the combination of these three characteristics: need for massive hence distributed computing power, huge amounts of data and machine learning to grow better algorithms.

This does not mean that « data as the new code » is a universal approach towards information system design. Massive distribution of data has its own constraints and faces fundamental (theory-proven) data architecture difficulties. They are known, for instance, as the CAP Theorem or the problem of snapshots algorithms in distributed computing. Simply put, it is not possible to get at the same time data consistency, high availability and fault tolerance if part of the network becomes unavailable. Big Data solutions usually pick a weakened form of consistency or availability. The logical consequence is that there remains domains – mostly related to transactions and ACID requirements as well as very low latency requirements – where “more classical “ architecture are still better suited.

4. Conclusion

These two paradigm shifts are accompanied by changes in culture, methods and tools. To finish this post and to summarize, I would quote: agile methods, open-source software culture and DevOps (continuous build, integration and delivery). It is stunningly obvious that one cannot succeed in developing the kind of closed-loop data mining systems described in Section 2, nor the machine-learning-data-driven algorithms described in section 3 without the help of agile methods. Agile methods advocate incremental and short batch development cycles, organized around multi-skills teams, where everyone works in a synchronous manner on the same objective. The same argument applies to the proper use of open-source software, though it is less obvious and comes from experience. It is not about using great software available for free, it is more about using continuously evolving software that represents the bleeding edge of big data technology. It is even more about the open source culture that fits like a glove to the concept of software factories (the topic of another post !). To succeed in this new software world, you need to love code and respect developers (and yes, I am aware of the paradox that this may cause together with the “data is the new code” motto). Last, there is no other way to produce continuously evolving code (which is implied by both these paradigm shifts, but is also true in the digital world) than switching to continuous build, integration and delivery, as exemplified by DevOps. I am quoting DevOps but I could also make a reference to the software factory idea (the two are closely related).

Not surprisingly, the reader who is familiar with “Les Géants du Web” [19] from Octo will recognize the culture which is common to the “Web Giants”, that is, the companies who are the most successful in the digital world, such as Amazon, Google or Facebook. There is no surprise because these companies are also amongst the world leaders in leveraging the promises of Big Data. Agile (hence collaborative) development is critical to Big Data which requires to mix computer science, information technology, statistical and domain matter (business) skills. Because Big Data requires to work on the “real” (large) sets of data, it means a strong collaboration between IT operations and development. This is made even more critical by the paradigm shift described in Section 2, since algorithmic development and tuning is embedded into an operation cycle, which is an obvious call for DevOps.

I will conclude with a few of the recommendation from the NATF report:

Big Data is much more than new opportunities to do new things. Fueled by a technology shift that is caused by drastic price drops (storage and computing), Big Data paradigm causes a disruption about how to build information systems.
Massive parallelism and huge volumes of data are bringing a new way of programming that is urgent to learn, and to teach. This goes for companies as well as universities or engineering schools.
The old world of cautious “analyze/model/design/run” waterfall linear projects is in competition with a new world of systemic loops “experiment/learn/try/check”. This is true for science [21] as well as for business. Hence, Big data’s new paradigms needs to be taught in business schools as well as in engineering schools.

Readers who are familiar with Francois Bourdoncle’s theses on Big Data will recognize them in these recommendations, which is quite natural since he was one of the experts audited by the ICT commission from the NATF.

Monday, July 14, 2014

Viral Propagation Models for Apps and Social Software

Today’s post is a follow-up from my previous text on software ecosystems : I will focus on the virality of social applications, that is, the ability for applications to grow their customer bases through social networks. This post is more technical than most, because it is unfortunately necessary, but I will try to keep everything “as simple as possible, but no simpler” :).

Social propagation of application is desirable because the fight to survive on the smartphone is quite tough. Not only do most people download only a few tens of apps, (statistics varies according to sources; however, the story is the same) but most of them are never used. 80 to 90% of downloaded apps are used only once then discarded. Becoming one of the few app that stays in the “smartphone top of mind” is very hand (i.e., active app), and being a collected app (installed for future use) seems to be very precarious. This is why the route of the web application (smart responsive HTML5 page with embedded bells & whistles) that is accessed through all the classical Web paths (search, links, etc.) is looking more and more interesting for many companies.

We may categorize the social behavior of apps into three categories:

Solo apps: applications whose main goal is to be used on your own, even if the score (of a game) may be shared eventually.
Communication apps: applications which are used to synchronously communicate with other people. The value of the service grows with the number of correspondents that may be reached.
Social apps: application which use asynchronous communication to become content publishing platforms. The distinction between “communication & social” will become clearer later on, but we may state right now that the value depends on the amount of available content, which depends on the total amount of time spent by social partners on the social app.

Not surprisingly, we know that solo apps appeal more to men than women. What I want to look at is the ability for social software (a larger category than apps) to propagate itself through social use and recommendation.

1. Metcalfe Law for Communication Software

If we consider a simple communication tool (such as instant messaging), its customer base defines a communication network which values grows as the square of the number of users (O(N^2)), according to Metcalfe’s Law. Metcalfe's Law states that the value of a communication network grows as the number of possible pairs of connected users.

The value for one individual is linear in the number of user (O(N)), but both the total value and the virality is quadratic. The virality, which is linked to the growth rate, may be seen as the product of the “infected” population (number of users) and the probability for one customer to “infect” another person (that is, recommend the service), which is liked to her or his satisfaction (hence, to the value).

One may notice that this is already quite different from an epidemiology model, since the probability of transmitting the disease does not only depend of being infected, but the number of your infected friends.

There are two points which are usually debated in this reasoning. The first remark states that we do not benefit from a very large network of possible contacts, since the number of meaningful correspondents is usually bounded (whether by Dunbar’snumber or any other).

The second idea is that all correspondents are not equals and that the communication time distribution usually follows a law similar to Zipf’s Law. This leads to the result that the value grows in a O(N logN) fashion. The whole issue boils to the question of knowing is the distribution of the communication tool among your possible correspondents is homogeneous (randomly distributed) or not. This is actually a debate about strong ties versus weak ties, one of my favorite topic. If the communication tool is used to communicate with your close friends, then the propagation model follows the strong ties social graph and we may assume that the value for each customer grows in O(log N) because of Zipf’s law. On the other hand, if the communication tool is used to reach a larger set of people, then the probability of one of these contact to be equipped with the same communication tool is roughly linear with respect to the usage rate, hence the individual value grown in O(N).

2. Social Software and Cumulative Valuation of Time

We now consider an application, like Facebook, that acts as an asynchronous content publishing platform. The key observation is that the value of a Facebook session does not depend on how many friends you have, but on how frequently they visit and contribute.

People have different profiles when it comes to reading and contributing on social platforms. However, it is plausible to assume that (a) the read/write ratio is different for each individual but remains rather stable over time (b) the amount of messages read and written is proportional to time spent on the social platform. Similarly, the attractiveness (i.e., interest to others) of content varies significantly from one user to another, but we may assume that the interest varies linearly with the amount of messages that are exchanged (this is clearly wrong for “newsworthy events” but seems to be true for the vast majority of exchanges that happen on Facebook).

This leads to a recursive system of complexity equations (written in a rather informal style) :

Total Value = N x Average Value
Average Value = Average Degree x O(Average Time Spent) x Filtering Factor
Average Time Spent = O(Average Value)

The only way to make this equation balanced is to assume that the asymptotic behavior of the “Filtering Factor” is O(1/D) (which makes sense, there is only so much that you can read). So if the average degrees grows, some filtering is necessary. For instance, Facebook relates that, every day, it has to choose between 1500 messages what to display to each user. This the role of the “Edge Rank” filtering algorithm, a topic which I have discussed in a previous post.

Once the role of “filtering” is understood, we are left with a “self-fulfilling” set of circular equations that tells us that the value is proportional to the average time spent, which is proportional to the perceived value. It may be thought of as a disappointing tautology, but it says that similar social platform may indeed know very different fates.

At this time we can state two things:

The formula that describes the value obtained by a social app user is complex, hence the virality percolation model is complex. It does not compare at all with an epidemiology model since the probability of “infecting” someone depends both on (a) the number of your infected friends (b) how deeply infected they are.
There is not simple model for understanding the spread of social network platforms : there may exist multiple solutions with similar customer bases (N). The example of Google Plus and Facebook springs to mind: They have both large customer bases (1230 Millions monthly active users for Facebook and 300 Millions montly active users for Google Plus) and average time spend stats which are totally different (8 hours per month for Facebook versus 7 minutes for Google Plus). Nothing in the percolation models tells if Google Plus should grow closer to FB in the future, it all depends on much finer details (value provided to the user per unit of time and per unit of meaningful social content). The non-linear nature of the equation (re-entering loop) means that a tiny difference in this value-creation function may lead to a radical difference in customer usage (i.e., the presentation difference produces different time allocation patterns that, in turn, amplify the perceived value difference).

Notice that usage and subscription are two very different things, with different percolation models. Subscription is much closer to an epidemiological model (modulo the observations that we made earlier), and it is both easier to predict and to favor viral adoption.

3. Why Facebook’s Doom Cannot Be Predicted with Epidemiological Models

Early this year there was a lot of excitement about a paper that predicted that Facebook would almost disappear before 2017. This information was printed and commented in many famous news sites and newspapers. The origin for this information is an "archive" (i.e., submitted for publication) paper from two Princeton PhD students, John Cannarella and Joshua Spechler.

Facebook replied with a humorous answer where they use different buggy-but-convincing statistics charts to show the future decline of Princeton and breathing air. They conclude that “We don’t really think Princeton or the world’s air supply is going anywhere soon. We love Princeton (and air). As data scientists, we wanted to give a fun reminder that not all research is created equal – and some methods of analysis lead to pretty crazy conclusions. »

I actually downloaded and read the article, which is very simple and straightforward. It looks at how social networks percolation may be modelled with an epidemiology model (which is clearly wrong, as we showed in the previous section). On the one hand, the paper is “technically correct” : it simply says, what would happen if Facebook’s usage behaved like a the spread of a disease ? What is incorrect is all the newspapers that drew the wrong conclusion. On the other hand, it is of no value since it is very clear that the model does not fit the problem. The fact that the authors were able to tweak the virology parameters so that the first phase of Facebook growth matched historical data is irrelevant. There are many percolation models that would give a similar “S-curve” phase of growth. I laughed at Facebook’s debunk of the article (the fact that is it quoted as viral / epidemiology research article from two PhD Students from the Mechanical and Aerospace Engineering department should have raised some suspicions), but the debunk misses the point : it is not poor data science, it is poor science to begin with. If you look at the illustration, you will see that the « input data » used for the epidemiology model is the number of « Facebook searches », which means that the decline may also be interpreted as the complete domination of Facebook !

4. Percolation Models for Social Software are Unstable

The previous “model” of section 2 is crude because it does not introduce the connection frequency. To understand and to model the behavior of a social app user, one need both the average frequency and the average time spent per users (20 mins for an average Facebook session and slightly more than once a day). I have tried to build a computational model two years ago, and failed because I did not have enough connection frequency data. This means that I could have used my model to predict almost any possible outcome … somehow like the Princeton computational experiment.

From a system science perspective, the “re-entrant” characteristic of the “time spent” parameter in the value equation means that any model is bound to be quite unstable and very sensitive to other dimensions (see the conclusion). One could point out that, as a consequence, the outcome proposed by John Cannarella and Joshua Spechler is not impossible :). Let us look at a possible “Facebook displacement” scenario (since users seem to enjoy the time they spend on Facebook, it is logical to assume that such a scenario is the outcome of the introduction of a newer, better platform). It makes perfect sense to illustrate this with the rise of Whatsapp (considering the money spent by Facebook to acquire them, someone else must have thought that there was a real threat). The scenario breaks into four steps:

A new app appears, that is more efficient for a new group of users (most likely, an aged-based group, but not necessarily, it may be a matter of geography or culture). WhatsApp is a great example since it has reached 500 M users in record time.
Because the app is significantly better (from the point of view of new users), it eats away the “free time budget” : the time spent on the new app is taken away from the time spent on Facebook. This is clearly true for WhatsApp with more than 10 hours of monthly use (here also, statistics vary, but the tally is still impressive).
This decreases the perceived value of Facebook for other users, who open an account and then spend some of their SNS time onto the new app. This has yet to appear for the WhatsApp case; for instance, in Spain where WhatsApp is very strong, Facebook is still growing, even if adoption rate is slower than other European countries. Also, the fastest growing segment of Facebook users is people over 55, it will be hard to get them away as a community.
Eventually the new app becomes the place where the majority of users go (there is a winner take all system dynamic, which has been very profitable for Facebook since it started).

Steps (1) and (2) may happen rapidly, but (3) and (4) will take much longer (this is a guess, as said earlier, the speed has nothing to do with an epidemiological model and is much harder to model). But time spent becomes a habit, and habit takes longer to change (it takes longer to forget a habit than to pick a new one).

A lot of work is available in the scientific community related to percolation over social networks, including the work from Callaway, Newman, Strogatz and Watts, which has inspired my own research about social networks. However, the time aspect of social network usage changes completely the percolation model.
The previous curve shows that social apps have a stronger percolation capability than simpler communication apps.

5. Conclusion

Rather than drawing a conclusion from this difficulty to efficiently model percolation of social software, I will simply point out a few directions for developing social and viral adoption of applications:

One must “pick the right fight”: it does not make sense to fight for usage time if the usage frequency is not high enough. If the frequency is too low, it’s a different game : how to use other SNS for “signaling” (letting people know that theirs friends have used your app).
“Surf the wave instead of racing it” : profit from existing SNS which are created as platforms, to leverage existing social networks to grow you own app's social usage.
Make it easy to share your content on competing platforms (a good example being LinkedIn which allows easy sharing with Twitter, while the reciprocate exchange, that is, sharing from Twitter on any other SNS, is not true).
Empower your users to do whatever they please with your app, making it a true "platform". This follows from the observation that increasing time spent will increase value, hence adoption. This is something that Facebook has been quite good at (although this is a subject of debate), and that Snapshat or Instagram are also good example of.
Think about “value / effort” all the time and focus on simplicity, usability and speed. Especially, to the previous point, sharing/publishing must be as effortless as possible. We are back to the “maximize the value per unit of time and unit of content” principle stated in Section 2. The dynamics of content/time percolation means that a small efficiency competitive advantage can accumulate rapidly into a larger content & customer base sustainable advantage.

Sunday, June 1, 2014

12 Principles of Lean Software Factories

This month’s post is simple one, which presents the concept of LSF (Lean Software Factory) through 12 principles. It will not bring forward new ideas compared to by previous posts, but it is a fresh way to look at the combination of agile/scrum/lean/devops without over-thinking about the influences or the relationships between different schools of thinking. This list of twelve principles is taken from the talk that I gave at the Lean Summit in Lyon. As I stated in the introduction, this is a "Toyota Way" "how to manual" for a software development team.

1. Organize work around cross-functional united teams

Team works leverages the strength of strong ties, that is, the links that create themselves between a group of people who work together all day long. It creates a shared context which is the most efficient form of implicit communication.
A team should leverage talent diversity through cross-functionality. Cross-functionality means not only that we have multiple skills within the team (which is necessary to tackle complex time) but that a fair amount of substitution is possible (many team members can lend a hand to any other member), a key for effective cooperation and flexibility.
Unity and versatility are mutually strengthening one another.
There is no longer a contractual vision of a client-supplier relationship with external hired help. Each member of the team has the same rights, which means that outside suppliers become partners.

2. Teams operate on a common synchronous time

Face-to-face communication replaces email for internal team one-to-one communication. This leverages the strength of both tone and non-verbal communication.
Every day starts with a stand-up meeting, which replaces a fair amount of one-to-one communications. The stand-up meeting builds the team spirit and common focus on the shared goal. Everyone tells where she or he stands (achievements of the previous day), explains what the objectives for the coming day are and share possible concerns.
The team operates on a common shared time, which is the customer’s time (following the lean concept of takt time). This is a clear departure for asynchronous work which has become the default mode for engineering in the past decades. The importance of synchronous work is well explained in “The Lean Startup”.

3. Customer-centric organization, for real.

The customer needs to be present on the software development premises. This is symbolic, through the availability of a customer wall or a customer room, which dynamically collects and display end-user problems, insights and aspirations. This is also physical, through the presence of a “customer-proxy” role within the team.
Software development and communication is organized around « user stories ».
Continuous improvement is a cornerstone team activity, which is not de-prioritized to add new features. Lean management principles of “zero defects” and “right on the first time” are applied thoroughly, because they have proven to produce customer satisfaction.
Last but not least, a customer-centric organization is bound to change its culture from the traditional project culture of software development to a product culture.

4. « Fail early to succeed sooner »: test as early as possible

« Test-driven development »: code developers need to start their programming with unit tests.
Testing must occur end-to-end, that is, from the early unit testing to the instrumented « test during production » (i.e., be able to run tests on deployed software). The (classical) lesson from software engineering is that everything should be tested “as early as possible” (unit testing, when building, when integrating, etc.).
The only way to run tests continuously is to automate them. Continuous building/integration and continuous testing are synonymous.

5. Iterative progress through constrained « small batches »

Small batches yield better performance and more motivated teams. It is also the best way to keep teams small, which is known to be more efficient.
Time Boxing: you fit the content to the box and not the opposite! To keep a synchronous planning (delays break cooperation and are known to be very expensive), you keep to your sprint schedule and adapt the workload dynamically.
Incremental development is better at adapting to a continuously changing environment: each « small batch » gives the opportunity to listen, reflect and adapt the product strategy and priorities.

6. « Show & Tell »: Love your code !

A software factory operates on the principle of fast changing code, which is why code must be easy to read and easy to understand, by all members of the team and not only the person who wrote it. Coding standards and pair programming are known techniques to produce easier to maintain code.
Team code reviews are a vital part of the LSF culture. On the one hand they create the right level of appropriation and common understanding that is necessary for the team to evolve its software asset. On the other hand, they create the “software pride” attitude, which is an engine for quality and innovation. This is very close to the “love of cars” that you find in a Toyota factory.
Code must not only be well structured and elegant, it must also be taken care of. The 5S practice of Lean Management applies to code : Sort (reduce the code base, apply quality metrics), systematize (organize into modules, packages & projects, apply coding guidelines), shine (clean up, improve test coverage, code reviews), standardize (make it into a set of practices), sustain (run the practices as part of the culture).

7. Use walls as tools for collective learning

Visual management is a great way for the team to communicate as a whole and to grab the dynamic “music sheet” of the product that is being built.
Walls and white boards are amazing collaborative tools. This is a proven scientific fact : a white surface that you can write on or pin things onto leverages many important features. Many people may work at the same time; multi-scale editing is easy (working at different levels of abstraction at the same time); information density is quite high; body language and dynamic processes are part of the experience.
Walls should be used to display all that is necessary to know about the software product, including its architecture and how it should operate. Architectural diagrams do not belong in folders or inside laptops, they should be displayed to contribute to the continuous training and education of the team.

8. Each team member produces what the other needs just in time

Use Kanban visuals to represent the team’s work in process (WIP). The first benefit of the Kanban display is to share the amount of ongoing work / use cases, make sure that nothing is forgotten, and avoid over committing (accepting a work load that this too much).
The Kanban display is a grid where the different steps of the software development are represented, which makes transitions from one team member to another easier because everyone knows the other’s current workload (second step of maturity). This is also where the cross-functional nature of the team may be put to good use.
The last maturation step occurs when each team member adjust her or his work according to the capacity of the next team member in the process chain. This is the “pull” control flow of lean management, which requires time to build but yields more efficiency through shorter development cycles.

9. Industrial tools for end-to-end software management

One cannot run an iterative and fast development cycles without an industrial method and the use of many tools. Code management benefits from a large number of tools, many of which may be found in the open source community: version management, profiling, dependency tracking, software quality tracking, etc.
Configuration management is the cornerstone of continuous integration and continuous deployment. Software builds need to be fully automated, including the management of network, hardware, and other configuration options.
The endgame of the software factory is to build the DevOps target of programmable hardware.

10. Continuous software integration: streamlining without waiting or accumulated surprises

Continuous integration means to build every day a fully functioning complete system. The rhythm may vary but the practice of building every night a system from the code that was committed during the day has shown its merits.
This means that the integration process, which used to be tedious, will be run hundreds of times during a development cycle. Therefore, it needs to be fully automated. This goes hand in hand with automated testing. The software developers find every morning the results of running the newly built system on the test library.
Continuous integration has the great cultural advantage of reminding everyone that the whole (system) is more important than the part (the daily pages of new code).

11. Team problem solving as collaboration & learning exercises

Team problem solving is used to solve problem and continuously increase the quality of the product. However, there are many other side benefits: team problem solving fosters collective learning of the functioning of the system that is being built.
Collaboration and collective learning is anything but easy. Therefore, it must follow a time-proven ritual, such as Kaizen. The lean practice of Kaizen does more than solving the quality problems that are being addressed: it creates a collective understanding of the system and the various roles within the team that prevents the occurrence of many future problems.
The practice of Kaizen revolves around the lean concept of standardized work. Standardization does not mean to freeze a way of doing things, it is an evolving body of knowledge that captures the collective know-how and is used to continuously set new challenges.

12. Deploy continuously to support iterative innovation

Following the DevOps principles, software products are deployed following a fast and regular rhythm (which is different for each company). The fast pace is critical to build the customer feedback learning loop.
Continuous delivery requires risk management though the principles of concentric community circles. You start with a small test population and you progressively extent to your complete customer base through steps that may be undone easily.
Each incremental development process (when you add small pieces after small pieces) is bound to produce junk over time. Thus refactoring and “tending the garden” are critical practices of agile development cycles. The new world of software is not about building a system but growing a platform.

This list is actually a simple collection of well-understood principles because it only represents what needs to be shared with the software development team. This is a “bottom-up” recipe which is an easy sell once the will to build an agile software factory is established. The hard part about lean software is stakeholder’s management (this is worth another post):

The role of management is deeply different from a traditional software development viewpoint.
Agility (incremental, test and learn) is a business, not a technical, mindset.
The benefits of a software factory (building a capability to continuously deliver an evolving platform as opposed to assembling a system) need to be shared and understood by the CEO.
Customer-centricity has to be deeply built into the company’s culture.

Sunday, April 27, 2014

Software Ecosystems and Application Sustainability

Today’s post is a set of simple, yet somehow deep, thoughts about the systemic nature of different ecosystems related to software. I was trained in the 80s to think about software costs in a “traditional software engineering” manner, using KPI, metrics and a spreadsheet (using cost models that were popularized by Peter Keen in the 90s). Somehow, this shows in my second book where one may find references to the classics (Barry Boehm, Casper Jones, etc.) in the bibliography section. What characterizes this way of thinking is that it is a static approach (even if the system changes, one think about a “snapshot” taken at a given time), controlled (one assumes that all stakeholders cooperate under a common control) and global (the cost model operates on what is thought to be a complete picture).

Life in this century is different when it comes to software. Software is a “live thing” – in the sense that it constantly evolves to adapt to its environment -, mostly distributed, with a large number of stakeholders whose strategy escape the control of the software developer. Hence static should become dynamic, controlled should become collaborative and global should become distributed. The dynamic and complex relationships between the stakeholders and between the various distributed players who contribute to building a software piece yield the “ecosystem” label. This word is borrowed from biology and is a signature of complexity.

This observation is actually one of the reasons for the title of this blog “Biology of Distributed Information Systems”. It helps to think about software and information systems by borrowing concepts for biology and ecology, and it is definitely necessary to switch from a static to dynamic analysis.

This is a first post on this topic, so I will keep things simple (hence somehow incomplete and arguable), and focus on three ecosystems:

The OS, platform and application ecosystem
The open-source ecosystem
The application developer ecosystem

I must apologize in advance to real software experts :). First, this is a post intended for readers with no precise skills nor knowledge about software. Second, I will reason in an “abstract category” way that will not dive into interesting but difficult distinctions. For instance, in this post, an “app” is a piece of interactive content, whether it is a “true application” written for a smartphone, a simple HTML page, an HTML5 page decorated with Java script or an hybrid mobile app. For this first post, I am aiming at a “big picture from 10000 feet up”.

1. Software Global Ecosystem

The starting point of the argument is the need for software that evolves constantly. You may accept this at face value because it is a commonly heard argument. If you need convincing, the need for constant evolution comes both from the technology (the “what is possible today ?” perimeter changes constantly) and the users. The complexity (i.e., richness) of software usage today means that the “user is in the driver seat”. That is, software needs to be co-designed with users, which is of the principles of Lean Startup (to name one reference, hundreds would apply here). This leads to an incremental model, which in turn requires a (much) faster code production rate. I will assume that you buy this argument, since this is not the topic for this post, and it is a fairly common assumption.

From this we derive two key consequences for software in the 21^st century (as opposed to the century when I was trained as a software developer):

Much more innovation is needed, which requires the help of an open innovation model. This leads to the concept of platforms, API and apps.
Much cheaper software production is needed, which itself requires a new level of sharing/reuse, based on a common/universal software architecture.

Software productivity, as defined by the cost to produce a function point, is improving slowly. This is a topic which I have addressed in depth in the previously quoted book. I have a vested interest with this question since I started my career as a computer scientist trying to build tools (languages) that would increase this productivity significantly. It turns out that the world needed a much more efficient way to reduce software development costs as we just saw and has found it through massive reuse, thanks a to common software architecture :

Open OS : open operating systems (LINUX, Android, etc.) have become, thanks to open innovation in the form of open source, massive repositories of reusable value. For reasons that will become clear in Section 3, the world needs as few of those as possible.
Platforms: on top of OS, platforms have emerged. One may think of the most common open source tools such as Apache or mySQL, the GAFA platforms or the web browsers. The rise of platforms over the last 20 years is coupled with the rise of API (Application Programming Interfaces) and the associated technologies (XML, Web Services, REST, JSON and the likes).
Apps (interactive content) : this is what the end user sees and interacts with. The combination of SDKs and platform APIs, together with open-source libraries, have made the production of apps orders of magnitude more efficient than when I started writing code 30 years ago.

Please note that the word “platform” is usually ambiguous: it may mean a cloud/service platform (a back-end platform that serves a front-end app) or an open back-end that collaborates with a many apps (or other service platforms) through APIs. Here I use platform in the second sense; the first is always included in the “app” perimeter because of “device agnosticism”. That is, to let the end user pick whichever device is more suited to her current context, each mobile app must have a dual cloud service platform. So, in this post, each “app” comes with associated set of back-office/cloud services and I reserve “platform” for the implicit open innovation approach (cf. “L’age de la multitude”)

2. Critical Mass and Software Usage

Even if software development is incremental, launching a successful product requires building a “critical mass” of value before one may start the “lean startup” positive feedback loop of co-creation. This is the “V” in MVP (Minimum Viable Product): there is a threshold of value brought to the user that one must pass before the percolation model of viral adoption (helped with proper marketing) may kick in. The analogy with living organisms (“SW as a living thing”) is relevant here: software requires growth, constant change and a sustainable equation of user growth. The MVP aims to reach the tipping point when viral adoption becomes sustainable (what Eric Ries calls “getting traction”).

The “critical mass of value” that is required for this tipping point varies considerably according to the pain point that the piece of software is trying to alleviate and according to the current state of the art. It may be measured in terms of function points (how rich an experience is necessary) and social weight. The complexity of modern experience comes from their social nature (if you think about it, most apps on your smartphone nowadays have a social component). Hence installing a new habit requires fighting against Metcalfe’s Law, and displacing a previous social usage requires even more efforts. The value critical mass may be large, which explains why some legacy Microsoft products such as Word, which I am still using to write technical papers, have not been easily replaced by open source alternatives.

To reach this “value critical mass”, someone must invest an initial significant software development effort. In a complex world, where the risk to fail is high, one must reduce this initial software development cost, as we explained earlier. This is also a signature of the complex environment we live in: we must switch from ROI (return on investment) to the affordable loss principle.

The percolation model shows why the ROI principle is no longer relevant: it is very hard to predict how well a successful MVP will percolate. The world is full of software startups with amazing valuations because their app found its way to massive deployment and usage. But it would be very difficult to predict such success a few years earlier, when the MVP was still a prototype.

The affordable loss approach means reducing the development cost to something that “you may afford to lose”. This means leveraging the previously shown layered architecture, and mostly leveraging the strength of the open source community. Open source software is a machine that is constantly churning out software platforms that start their own journey towards fame and critical mass. The quality of open source software is directly related to its adoption (because good software is built incrementally through feedback – a key axiom). Adoption is inversely proportional to genericity, hence a using open source software is a “connected art”. One must understand the communities’ sizes and dynamics to select “the pieces of the puzzle”. Open source software yields by construction a nested / layered structure of software libraries with a combination of very high quality stable platforms for the common needs and more experimental gems with specific capabilities. This is why the word “ecosystem” is so relevant to open source software. One should not think of a catalog of free software, but of a nested hypergraph of communities, where a collaborative price must be paid. This price is measured in (participation) time and (code) sharing. This represents a culture shift for most software organizations, but the efficiency of those who “play the game right” is such that it becomes the only game worth playing.

This is just a hint of what a proper open source strategy should be, since there are so many aspects that I am not touching here. Open source is not only about software libraries, it is also about development methods, tools and processes, cloud computing and hardware, to name a few. As a follow-up, I would suggest reading Octo’s great book “Les Géants du Web”, take a closer look at DevOps or leverage the value that is found in the Open Compute project.

3. The App developer's equation

This last part looks at app sustainability from the perspective of the developer. I have used the following (abstract) equation in my presentations for the past 10 years:

Attractiveness = Market x Generosity x Value / Effort

In this equation,

Market is the size of the potential market size that a given platform is proposing. This equation was developed to understand the fate of mobile OS, but it applies to all kinds of platforms, from cloud service platforms that propose APIs (another dream of phone operators for the past 10 years) to connected objects.
Generosity is the share of the revenue (app price or advertizing revenue) that is sent back to the developer. For instance, Apple keeps a hefty 30%, whereas Android is more generous. Set-up costs should be factored in, for those platforms where some form of license or tools investment is still necessary.
Value is what the platform brings to the developer, as far as the end user experience is concerned. When I said “abstract” earlier, I meant that I don’t have a formula to measure “Value”. Most often, it is a judgment call from the developer, who evaluates what innovative and relevant services may be developed with the platform. There is subjectivity involved, such as the infamous “cool factor”, which favors sets of APIs & features from which “cool stuff” may be built.
Effort is the amount of time it takes to build “one unit of value”. This is where the difference stands between great players, who provide the right SDK, community support, testing services, and an efficient delivery (store) platform, and other less qualified players. Many APIs exposure programs have failed during the past 10 years because the effort expected from the developer was much too high. This is also why one should enroll help from qualified actors such as Apigee to develop an API strategy.

Roughly speaking, the equation is an abstract form of “Expected income / Expected Effort”. I have used this equation in the past years to explain why there would be two or at most three mobile open OS in the future, but it actually tells a lot of things. You may understand why Microsoft announced (at last) that Windows Phone would be free in the future. The “Market factor” yields a “winner take all” dynamic that we have observed for many platforms (the Matthew effect : the platform with the more users attracts more developers, benefiting from more open innovation, hence attracting more new customers). It also gives a few insights for a successful platform strategy:

Growth : get a critical mass of customers, as quickly as possible. To jumpstart the virtuous cycle (that is, enroll app developers while the market is still not here, one must use rewards and gamification – such as hackathons).
Expose as much value from your APIs as possible, with a focus on differentiation that is exposing stuff which is both useful and not readily available elsewhere. This is probably the most strategic factor to predict the success of connected objects in the future, or larger domains such as the connected (smart) home.
Reduce the effort for the developer by embracing the open source and Web standards (languages, development tools, API styles, libraries, etc.). The adjacent illustration of the “Fun vs. Effort” is taken from a humorous site, but thinking in terms of value/effort is critical to system analysis.

This equation gives also a way to evaluate the intrinsic value of a platform. It follows from what was said earlier that the value is the capacity to generate revenue streams from apps. This, in turn, is mostly related to accumulated user data. This leads to the idea, reported by Henri Verdier, that data is the new code. The algorithms change constantly, and the best one are produced by external developers (hence the open innovation paradigm). What changes more slowly is the API structure and what accumulates over time is the amount of user data. This is important enough to be worth a future post when I report about the work of our group at the NATF on Big Data.

Biology of Distributed Information Systems