Biology of Distributed Information Systems: Big Data hides more than one paradigm shift

This post originates from a report which I wrote this summer for the NATF, which proposes a summary of what the ICT commission learned from its two-years cycle of interviews about Big Data. The commission decided to investigate about the impact of Big Data on French economy in 2012. Big Data is such a popular and hyped topic that it was not clear, at first, if a report would be necessary. So many books and reports have been published in the past two years (see the extract from the bibliography in the next section) that it made little sense to add a new one. However, throughout our interviews, and thanks to the FoE conference that NATF co-organized last year with NAE in Chantilly – which included Big Data as one of its four topics – we came to think that there was more to say about the topic than the usual piece about new customer insights, data scientists, the internet of things and the opportunities for new services.

Today I will focus on two ideas that may be characterized as paradigm shifts. I will keep this post short so it may be seen as a “teaser” for the full report which should be available soon. The first paradigm shift is a new way to analyze data based on systemic cycles and the real-time analysis of correlation. The old adage “correlation is not causation” is made “obsolete” because data mining is embedded into an operational loop that is judged, not by the amount of knowledge that is extracted, but by the dollar amount of new business that is generated. The second paradigm shift is about programming: Big data entails a new way to produce code in a massively distributed environment. This disruption comes from two fronts: the massive volume of data requires to distribute both data and procedure, on the one hand, and algorithmic tuning needs to be automated, on the other hand. Algorithms are grown as much as they are designed, they are derived from data though machine learning.

The full report contains an overview of what Big Data is because a report from NATF needs to be self-contained, but this is not necessary for a blog post. I assume that the reader has some familiarity with the Big Data topic. Otherwise, the Wikipedia page is a good place to start, followed by the upcoming bibliography entries, first of which Viktor Mayer-Schönberger and Kenneth Cukier’s book.

1. Big Data – A Revolution That Will Transform How We Live

This is the title of the book from Viktor Mayer-Schönberger and Kenneth Cukier, which covers the most famous paradigm shift of Big Data, which is its ability to transform our lives, from hard science, such as medicine, to marketing. The paradigm shift comes from the combination of what technology makes possible today – the ability to analyze very large amount of heterogeneous data in a very short amount of time – and the availability of the relevant data, which are the traces of our lives that have become digital. Thanks to the web, to smartphones, to technology which is everywhere in our lives and objects, there is a continuous stream of information that describes the world and our actions. Big Data may be described as the Information Technology which is able to mine this “digital logs” and produce new insights, opportunities and services. The constant improvement of technology (from Moore’s Law about processing to Kryder’s Law about storage) is matched by the increase in digital details about our lives. New connected objects, sensors and the growth of IoT (Internet of Things) mean that we are only seeing the beginning of what Big Data will be able to do in the future.

One of the reason for not discussing these themes further is that the book from Viktor Mayer-Schönberger and Kenneth Cukier covers them very well so that I encourage you to read it. The other reason is that there are many other sources that develop these theses. Here is a short extract from our report’s bibliography :

[1] Commission Anne Lauvergeon. Un principe et sept ambitions pour l’innovation. 2013.
[2] John Podesta & al. Big Data : Seizing Opportunities, preserving values. Executive Office of the President, May 2014.
[3] François Bourdoncle. “Peut-on créer un écosystème français du Big Data ?”, Le Journal de l’Ecole de Paris n°108, Juillet/Aout 2014.
[5] Viktor Mayer-Schönberger, Kenneth Cukier. Big Data – A Revolution That Will Transform How We Live, Work and Think. John Murray, 2013.
[8] Gilles Babinet. L’ère numérique, un nouvel âge de l’humanité : Cinq mutations qui vont bouleverser votre vie. Le Passeur, 2014.
[10] Phil Simon. The Age of The Platform – How Amazon, Apple, Facebook and Google have redefined business. Motion Publishing, 2011.
[12] IBM Global Business Services, « Analytics : Real-world use of big data in telecommunications – How innovative communication service providers are extracting value from uncertain data”. IBM Institute for Business Value, Avril 2013.
[13] Thomas Dapp. “Big Data – The untamed force”, Deutsche Bank Research, May 5, 2014.
[15] David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. “The Parable of Google Flu: Traps in Big Data Analysis”
[16] Tim Harford. “Big data: are we making a big mistake?”, Financial Times, March 28^th, 2014.
[19] Octo Technology. Les géants du Web : Culture – Pratiques - Architecture. Octo 2012.
[21] Tony Hey, Stewart Tansley, Kristin Tolle (eds). The Fourth Paradigm – Data-Intensive Scientific Discovery. Microsoft Research, 2009.
[22] Max Lin. “Machine Learning on Big Data – Lessons Learned from Google Projects”.
[24] Michael Kopp. “Top Performance Problems discussed at the Hadoop and Cassandra Summits”, July 17, 2013.
[25] Eddy Satterly. « Big Data Architecture Patterns ».
[26] Paul Ohm. “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization”. UCLA Law Review, Vol. 57, p. 1701, 2010
[27] CIGREF, « Big Data : La vision des grandes entreprises », 2013

2. Big Data as a new way to extract value from data

As introduced earlier, the key idea is to forget about causation, or trying to extract knowledge from data mining. A large part of Viktor Mayer-Schönberger and Kenneth Cukier’s book (chapter 5) is dedicated to the difference between causation and correlation. But what we found when we heard about “real big data systems”, such as those from Google or Criteo, is that these systems are anything but static data mining systems that aimed at producing knowledge. They are dynamic systems that constantly evolve and “learn” from the data, in a controlled loop. Most big data statistical tools, such as logistic regressions, or machine learning algorithms are still looking for correlations, but these correlations are not meant to hold intrinsic value, they are input for action, whose effect is measured in real-time. Hence knowing the “why” of the correlation, knowing if there is a causation, a reverse causation, a complex dependency circle which is the signature of complex systems … does not really matter. Nor does it matter (as much as in the past) to be assured that the correlation is stable and will last in time. The correlation detection is embedded into a control loop that is evaluated though the overall process financial result.

This Big Data approach towards data mining is about practice and experiments. It is also broader in scope than statistics or data mining, since the performance comes from the whole “dynamic system” implementation. It is also easy to deduce from this, which we do in our report with more details, that building such a Big Data system is a team effort, with a strong emphasis on technology and distributed systems and algorithms. From a marketing perspective, the goal is no longer to produce “customer knowledge” – which meant to understand what the customer wants – but to build an adaptive process which leads to better customer satisfaction – which is actually less ambitious. If we consider the Walmart example that is detailed in the previously mentioned book [5], analyzing checkout receipts produces correlations that need not to be thought as “customer insights”. There is no need to find out why there is a correlation between purchases of diapers and beer packs. It is enough to put these two items close by and see if sales improve (which they did). In the virtual world of the Web, testing these hypotheses becomes a child play (no physical displacement necessary).

There is a natural risk of error if one takes these correlations out of their dynamic-loop context and tries to see them as prediction. Our different experts, including the speakers invited for the 2013 Frontiers of Engineering conference – in particular Jeff Hammerbacher from Cloudera et Thomas Hofmann from Google – were adamant about the fact that “Big Data produces information that one does not really understand”, which creates the risk of poor utilization. This is similar in a sense to the phenomenon of « spurious correlations », which says that one analyses a large cloud of data points with a very large number of variables, one finds statistically a large number of correlations without any meaningful significance. A great example of avoiding such pitfall is given by the “Google Flu Trends” (GFT) story. When they analyzed search requests that used works linked to Flu, Google researchers found that they could forecast flu epidemics propagation with a good level of accuracy. This claim was instantly absorbed, amplified and orchestrated as a proof of Big Data greatness. Then more detailed analysis [15] [16] showed the limits and the shortcoming of GFT. The article that is publish on Harvard’s blog [15] is actually quite balanced. Although it is now clear that GFT is not a panacea and shows more errors than other simpler and more robust forecasting methods, the articles also states that : « The initial vision regarding GFT – that producing a more accurate picture of the current prevalence of contagious diseases might allow for life-saving interventions – is fundamentally correct, and all analyses suggest that there is indeed valuable signals to be extracted ».

3. Data is the new code

This great catch phrase was delivered to us by Henri Verdier, one of the many experts that was interviewed by the ICT commission. When Google’s teams look at a new startup, they compute the valuation mostly from the volume and the quantity of data that has been collected, with much less regard for the code that has been developed. Data valuation comes both from the difficulty to collect the data and its estimated future usage potential. Code is seen as an “artefact” that is linked to data, which is both destined to change and easy to replace. In this new world of Big Data, code is conceptually less important than the data it applied to. It is less important because it changes constantly, because it is made of simple sub-linear algorithms (the only ones that can be run onto petabytes of data) and because it is the result of a learning loop (simple algorithms in their principles, but zillions of parameters that require to be fine-tuned through experiments). To caricature the reasoning, Google could tell to these young startups : “I will buy your data and I will re-grow the code base using our own methods”.

This new way of programming does not apply only to new problems and new opportunities! This approach may be used to re-engineer more classical “information systems” through the combined application of commodity computing, massively parallel programming and open source data distribution software tools. This combination helps win one or two orders of magnitude with respect to cost, as was shown to us with numerous examples. In the previously mentioned book [5], one may learn about the VISA example, where Big Data technology was used to re-build an IT process with spectacular gains in cost and throughput. This “new way of programming”, centered on data, may be characterized in three ways:

Massively parallel programming because of the distribution of very large amount of data. The data distribution architecture becomes the software architecture because, as the volume grows, it becomes important to avoid “moving data”.
Sub-linear algorithms (whose compute time grows slower than the amount of data that they process) play a key role. We heard many great examples about the importance of such algorithms, such as the use of Hyperloglog counters in the computation of Facebook social graph diameter.
Algorithms need to be adaptive and tuned incrementally from their data. Hence machine learning becomes a key skill when one works on a very large amount of data.

During the 2013 FoE conference, Thomas Hoffman told us that “Big data is getting at the core of computer science”. This means that all current problems that receives the attention of today computer scientists, such as artificial intelligence, robotics, natural language processing, behavioral learning, and so on, all require the combination of these three characteristics: need for massive hence distributed computing power, huge amounts of data and machine learning to grow better algorithms.

This does not mean that « data as the new code » is a universal approach towards information system design. Massive distribution of data has its own constraints and faces fundamental (theory-proven) data architecture difficulties. They are known, for instance, as the CAP Theorem or the problem of snapshots algorithms in distributed computing. Simply put, it is not possible to get at the same time data consistency, high availability and fault tolerance if part of the network becomes unavailable. Big Data solutions usually pick a weakened form of consistency or availability. The logical consequence is that there remains domains – mostly related to transactions and ACID requirements as well as very low latency requirements – where “more classical “ architecture are still better suited.

4. Conclusion

These two paradigm shifts are accompanied by changes in culture, methods and tools. To finish this post and to summarize, I would quote: agile methods, open-source software culture and DevOps (continuous build, integration and delivery). It is stunningly obvious that one cannot succeed in developing the kind of closed-loop data mining systems described in Section 2, nor the machine-learning-data-driven algorithms described in section 3 without the help of agile methods. Agile methods advocate incremental and short batch development cycles, organized around multi-skills teams, where everyone works in a synchronous manner on the same objective. The same argument applies to the proper use of open-source software, though it is less obvious and comes from experience. It is not about using great software available for free, it is more about using continuously evolving software that represents the bleeding edge of big data technology. It is even more about the open source culture that fits like a glove to the concept of software factories (the topic of another post !). To succeed in this new software world, you need to love code and respect developers (and yes, I am aware of the paradox that this may cause together with the “data is the new code” motto). Last, there is no other way to produce continuously evolving code (which is implied by both these paradigm shifts, but is also true in the digital world) than switching to continuous build, integration and delivery, as exemplified by DevOps. I am quoting DevOps but I could also make a reference to the software factory idea (the two are closely related).

Not surprisingly, the reader who is familiar with “Les Géants du Web” [19] from Octo will recognize the culture which is common to the “Web Giants”, that is, the companies who are the most successful in the digital world, such as Amazon, Google or Facebook. There is no surprise because these companies are also amongst the world leaders in leveraging the promises of Big Data. Agile (hence collaborative) development is critical to Big Data which requires to mix computer science, information technology, statistical and domain matter (business) skills. Because Big Data requires to work on the “real” (large) sets of data, it means a strong collaboration between IT operations and development. This is made even more critical by the paradigm shift described in Section 2, since algorithmic development and tuning is embedded into an operation cycle, which is an obvious call for DevOps.

I will conclude with a few of the recommendation from the NATF report:

Big Data is much more than new opportunities to do new things. Fueled by a technology shift that is caused by drastic price drops (storage and computing), Big Data paradigm causes a disruption about how to build information systems.
Massive parallelism and huge volumes of data are bringing a new way of programming that is urgent to learn, and to teach. This goes for companies as well as universities or engineering schools.
The old world of cautious “analyze/model/design/run” waterfall linear projects is in competition with a new world of systemic loops “experiment/learn/try/check”. This is true for science [21] as well as for business. Hence, Big data’s new paradigms needs to be taught in business schools as well as in engineering schools.

Readers who are familiar with Francois Bourdoncle’s theses on Big Data will recognize them in these recommendations, which is quite natural since he was one of the experts audited by the ICT commission from the NATF.

Biology of Distributed Information Systems

Friday, October 24, 2014

Big Data hides more than one paradigm shift

1 comment:

My Links

Blog Archive

Other Blogs and Sites