This post originates from a report which I
wrote this summer for the NATF, which proposes a summary of what the ICT
commission learned from its two-years cycle of interviews about Big Data. The
commission decided to investigate about the impact of Big Data on French
economy in 2012. Big Data is such a popular and hyped topic that it was not
clear, at first, if a report would be necessary. So many books and reports have
been published in the past two years (see the extract from the bibliography in
the next section) that it made little sense to add a new one. However,
throughout our interviews, and thanks to the FoE conference that NATF co-organized last year
with NAE in Chantilly – which included Big Data as one of its four topics – we
came to think that there was more to say about the topic than the usual piece
about new customer insights, data scientists, the internet of things and the
opportunities for new services.
Today I will focus on two ideas that may be
characterized as paradigm shifts. I will keep this post short so it may be seen
as a “teaser” for the full report which should be available soon. The first
paradigm shift is a new way to analyze data based on systemic cycles and the
real-time analysis of correlation. The old adage “correlation is not causation”
is made “obsolete” because data mining is embedded into an operational loop
that is judged, not by the amount of knowledge that is extracted, but by the
dollar amount of new business that is generated. The second paradigm shift is
about programming: Big data entails a new way to produce code in a massively
distributed environment. This disruption comes from two fronts: the massive
volume of data requires to distribute both data and procedure, on the one hand,
and algorithmic tuning needs to be automated, on the other hand. Algorithms are
grown as much as they are designed, they are derived from data though machine
learning.
The full report contains an overview of what
Big Data is because a report from NATF needs to be self-contained, but this is
not necessary for a blog post. I assume that the reader has some familiarity
with the Big Data topic. Otherwise, the Wikipedia page
is a good place to start, followed by the upcoming bibliography entries, first
of which Viktor Mayer-Schönberger and Kenneth Cukier’s book.
1. Big Data – A Revolution That Will
Transform How We Live
This is the title of the book from Viktor
Mayer-Schönberger and Kenneth Cukier, which covers the most famous paradigm
shift of Big Data, which is its ability to transform our lives, from hard
science, such as medicine, to marketing. The paradigm shift comes from the
combination of what technology makes possible today – the ability to analyze
very large amount of heterogeneous data in a very short amount of time – and
the availability of the relevant data, which are the traces of our lives that
have become digital. Thanks to the web, to smartphones, to technology which is
everywhere in our lives and objects, there is a continuous stream of
information that describes the world and our actions. Big Data may be described
as the Information Technology which is able to mine this “digital logs” and
produce new insights, opportunities and services. The constant improvement of
technology (from Moore’s Law about processing to Kryder’s Law about storage) is matched by the increase in
digital details about our lives. New connected objects, sensors and the growth
of IoT (Internet of
Things) mean that we are only seeing the beginning of what Big Data will be
able to do in the future.
One of the reason for not discussing these
themes further is that the book from Viktor Mayer-Schönberger and Kenneth
Cukier covers them very well so that I encourage you to read it. The other
reason is that there are many other sources that develop these theses. Here is
a short extract from our report’s bibliography :
[1] Commission Anne Lauvergeon. Un principe et sept ambitions pour
l’innovation. 2013.
[2] John Podesta & al. Big Data : Seizing Opportunities, preserving values. Executive Office of the President, May 2014.
[3] François Bourdoncle. “Peut-on créer un écosystème français du Big Data ?”, Le Journal de l’Ecole de Paris n°108, Juillet/Aout 2014.
[5] Viktor Mayer-Schönberger, Kenneth Cukier. Big Data – A Revolution That Will Transform How We Live, Work and Think. John Murray, 2013.
[8] Gilles Babinet. L’ère numérique, un nouvel âge de l’humanité : Cinq mutations qui vont bouleverser votre vie. Le Passeur, 2014.
[10] Phil Simon. The Age of The Platform – How Amazon, Apple, Facebook and Google have redefined business. Motion Publishing, 2011.
[12] IBM Global Business Services, « Analytics : Real-world use of big data in telecommunications – How innovative communication service providers are extracting value from uncertain data”. IBM Institute for Business Value, Avril 2013.
[13] Thomas Dapp. “Big Data – The untamed force”, Deutsche Bank Research, May 5, 2014.
[15] David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. “The Parable of Google Flu: Traps in Big Data Analysis”
[16] Tim Harford. “Big data: are we making a big mistake?”, Financial Times, March 28th, 2014.
[19] Octo Technology. Les géants du Web : Culture – Pratiques - Architecture. Octo 2012.
[21] Tony Hey, Stewart Tansley, Kristin Tolle (eds). The Fourth Paradigm – Data-Intensive Scientific Discovery. Microsoft Research, 2009.
[22] Max Lin. “Machine Learning on Big Data – Lessons Learned from Google Projects”.
[24] Michael Kopp. “Top Performance Problems discussed at the Hadoop and Cassandra Summits”, July 17, 2013.
[25] Eddy Satterly. « Big Data Architecture Patterns ».
[26] Paul Ohm. “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization”. UCLA Law Review, Vol. 57, p. 1701, 2010
[27] CIGREF, « Big Data : La vision des grandes entreprises », 2013
[2] John Podesta & al. Big Data : Seizing Opportunities, preserving values. Executive Office of the President, May 2014.
[3] François Bourdoncle. “Peut-on créer un écosystème français du Big Data ?”, Le Journal de l’Ecole de Paris n°108, Juillet/Aout 2014.
[5] Viktor Mayer-Schönberger, Kenneth Cukier. Big Data – A Revolution That Will Transform How We Live, Work and Think. John Murray, 2013.
[8] Gilles Babinet. L’ère numérique, un nouvel âge de l’humanité : Cinq mutations qui vont bouleverser votre vie. Le Passeur, 2014.
[10] Phil Simon. The Age of The Platform – How Amazon, Apple, Facebook and Google have redefined business. Motion Publishing, 2011.
[12] IBM Global Business Services, « Analytics : Real-world use of big data in telecommunications – How innovative communication service providers are extracting value from uncertain data”. IBM Institute for Business Value, Avril 2013.
[13] Thomas Dapp. “Big Data – The untamed force”, Deutsche Bank Research, May 5, 2014.
[15] David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. “The Parable of Google Flu: Traps in Big Data Analysis”
[16] Tim Harford. “Big data: are we making a big mistake?”, Financial Times, March 28th, 2014.
[19] Octo Technology. Les géants du Web : Culture – Pratiques - Architecture. Octo 2012.
[21] Tony Hey, Stewart Tansley, Kristin Tolle (eds). The Fourth Paradigm – Data-Intensive Scientific Discovery. Microsoft Research, 2009.
[22] Max Lin. “Machine Learning on Big Data – Lessons Learned from Google Projects”.
[24] Michael Kopp. “Top Performance Problems discussed at the Hadoop and Cassandra Summits”, July 17, 2013.
[25] Eddy Satterly. « Big Data Architecture Patterns ».
[26] Paul Ohm. “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization”. UCLA Law Review, Vol. 57, p. 1701, 2010
[27] CIGREF, « Big Data : La vision des grandes entreprises », 2013
2. Big Data as a new way to extract
value from data
As introduced earlier, the key idea is to
forget about causation, or trying to extract knowledge from data mining. A
large part of Viktor Mayer-Schönberger and Kenneth Cukier’s book (chapter 5) is
dedicated to the difference between causation and correlation. But what we
found when we heard about “real big data systems”, such as those from Google or
Criteo, is that these systems are anything but static data mining systems that
aimed at producing knowledge. They are dynamic systems that constantly evolve
and “learn” from the data, in a controlled loop. Most big data statistical
tools, such as logistic regressions, or machine learning algorithms are still
looking for correlations, but these correlations are not meant to hold
intrinsic value, they are input for action, whose effect is measured in
real-time. Hence knowing the “why” of the correlation, knowing if there is a
causation, a reverse causation, a complex dependency circle which is the
signature of complex systems … does not really matter. Nor does it matter (as
much as in the past) to be assured that the correlation is stable and will last
in time. The correlation detection is embedded into a control loop that is
evaluated though the overall process financial result.
This Big Data approach towards data mining is
about practice and experiments. It is also broader in scope than statistics or
data mining, since the performance comes from the whole “dynamic system”
implementation. It is also easy to deduce from this, which we do in our report
with more details, that building such a Big Data system is a team effort, with
a strong emphasis on technology and distributed systems and algorithms. From a marketing perspective, the goal is no
longer to produce “customer knowledge” – which meant to understand what the
customer wants – but to build an adaptive process which leads to better
customer satisfaction – which is actually less ambitious. If we consider the
Walmart example that is detailed in the previously mentioned book [5],
analyzing checkout receipts produces correlations that need not to be thought
as “customer insights”. There is no need to find out why there is a correlation
between purchases of diapers and beer packs. It is enough to put these two
items close by and see if sales improve (which they did). In the virtual world
of the Web, testing these hypotheses becomes a child play (no physical
displacement necessary).
There is a
natural risk of error if one takes these correlations out of their dynamic-loop
context and tries to see them as prediction. Our different experts, including
the speakers invited for the 2013 Frontiers of Engineering conference – in particular Jeff
Hammerbacher from Cloudera et Thomas Hofmann from Google – were adamant about
the fact that “Big Data produces information that one does not really
understand”, which creates the risk of poor utilization. This is similar in a
sense to the phenomenon of « spurious correlations », which says that one analyses a large cloud
of data points with a very large number of variables, one finds statistically a
large number of correlations without any meaningful significance. A great
example of avoiding such pitfall is given by the “Google Flu Trends” (GFT)
story. When they analyzed search requests that used works linked to Flu, Google
researchers found that they could forecast flu epidemics propagation with a
good level of accuracy. This claim was instantly absorbed, amplified and
orchestrated as a proof of Big Data greatness. Then more detailed analysis [15]
[16] showed the limits and the shortcoming of GFT. The article that is publish
on Harvard’s blog [15] is actually quite balanced. Although it is now clear
that GFT is not a panacea and shows more errors than other simpler and more
robust forecasting methods, the articles also states that : « The initial vision regarding GFT – that
producing a more accurate picture of the current prevalence of contagious
diseases might allow for life-saving interventions – is fundamentally correct,
and all analyses suggest that there is indeed valuable signals to be extracted ».
3. Data is the new code
This great
catch phrase was delivered to us by Henri
Verdier, one of the many experts that was interviewed by the ICT
commission. When Google’s teams look at a new startup, they compute the
valuation mostly from the volume and the quantity of data that has been
collected, with much less regard for the code that has been developed. Data
valuation comes both from the difficulty to collect the data and its estimated
future usage potential. Code is seen as an “artefact” that is linked to data,
which is both destined to change and easy to replace. In this new world of Big
Data, code is conceptually less important than the data it applied to. It is
less important because it changes constantly, because it is made of simple
sub-linear algorithms (the only ones that can be run onto petabytes of data)
and because it is the result of a learning loop (simple algorithms in their
principles, but zillions of parameters that require to be fine-tuned through
experiments). To caricature the reasoning, Google could tell to these young
startups : “I will buy your data and I will re-grow the code base using our own
methods”.
This new
way of programming does not apply only to new problems and new opportunities!
This approach may be used to re-engineer more classical “information systems”
through the combined application of commodity computing, massively parallel
programming and open source data distribution software tools. This combination helps
win one or two orders of magnitude with respect to cost, as was shown to us with
numerous examples. In the previously mentioned book [5], one may learn about
the VISA example, where Big Data technology was used to re-build an IT process
with spectacular gains in cost and throughput. This “new way of programming”,
centered on data, may be characterized in three ways:
- Massively parallel programming because of the distribution of very large amount of data. The data distribution architecture becomes the software architecture because, as the volume grows, it becomes important to avoid “moving data”.
- Sub-linear algorithms (whose compute time grows slower than the amount of data that they process) play a key role. We heard many great examples about the importance of such algorithms, such as the use of Hyperloglog counters in the computation of Facebook social graph diameter.
- Algorithms need to be adaptive and tuned incrementally from their data. Hence machine learning becomes a key skill when one works on a very large amount of data.
During the 2013 FoE conference, Thomas Hoffman
told us that “Big data is getting at the core of computer science”.
This means that all current
problems that receives the attention of today computer scientists, such as
artificial intelligence, robotics, natural language processing, behavioral
learning, and so on, all require the combination of these three
characteristics: need for massive hence distributed computing power, huge
amounts of data and machine learning to grow better algorithms.
This does not mean that « data
as the new code » is a universal approach towards information system
design. Massive distribution of data has its own constraints and faces
fundamental (theory-proven) data architecture difficulties. They are known, for
instance, as the CAP Theorem or the problem of snapshots algorithms in distributed computing. Simply put, it is not possible to get at the same
time data consistency, high availability and fault tolerance if part of the
network becomes unavailable. Big Data solutions usually pick a weakened form of
consistency or availability. The logical consequence is that there remains
domains – mostly related to transactions and ACID requirements as well as very low latency requirements
– where “more classical “ architecture are still better suited.
4. Conclusion
These two
paradigm shifts are accompanied by changes in culture, methods and tools. To
finish this post and to summarize, I would quote: agile methods, open-source
software culture and DevOps
(continuous build, integration and delivery). It is stunningly obvious that one
cannot succeed in developing the kind of closed-loop data mining systems described
in Section 2, nor the machine-learning-data-driven algorithms described in
section 3 without the help of agile methods. Agile methods advocate incremental and short
batch development cycles, organized around multi-skills teams, where everyone
works in a synchronous manner on the same objective. The same argument applies to the proper use of
open-source software, though it is less obvious and comes from experience. It
is not about using great software available for free, it is more about using
continuously evolving software that represents the bleeding edge of big data
technology. It is even more about the open source culture that fits like a
glove to the concept of software
factories (the topic of another post !). To succeed in this new software
world, you need to love code and respect developers (and yes, I am aware of the paradox that this may
cause together with the “data is the new code” motto). Last, there is no
other way to produce continuously evolving code (which is implied by both these
paradigm shifts, but is also true in the digital
world) than switching to continuous build, integration and delivery, as
exemplified by DevOps. I am
quoting DevOps but I could also make a
reference to the software factory idea (the two are closely related).
Not
surprisingly, the reader who is familiar with “Les Géants du Web” [19] from
Octo will recognize the culture which is common to the “Web Giants”, that is,
the companies who are the most successful in the digital world, such as Amazon,
Google or Facebook. There is no surprise because these companies are also
amongst the world leaders in leveraging the promises of Big Data. Agile (hence
collaborative) development is critical to Big Data which requires to mix
computer science, information technology, statistical and domain matter
(business) skills. Because Big Data requires to work on the “real” (large) sets
of data, it means a strong collaboration between IT operations and development.
This is made even more critical by the paradigm shift described in Section 2,
since algorithmic development and tuning is embedded into an operation cycle,
which is an obvious call for DevOps.
I will
conclude with a few of the recommendation from the NATF report:
- Big Data is much more than new opportunities to do new things. Fueled by a technology shift that is caused by drastic price drops (storage and computing), Big Data paradigm causes a disruption about how to build information systems.
- Massive parallelism and huge volumes of data are bringing a new way of programming that is urgent to learn, and to teach. This goes for companies as well as universities or engineering schools.
- The old world of cautious “analyze/model/design/run” waterfall linear projects is in competition with a new world of systemic loops “experiment/learn/try/check”. This is true for science [21] as well as for business. Hence, Big data’s new paradigms needs to be taught in business schools as well as in engineering schools.
Readers who
are familiar with Francois Bourdoncle’s theses on Big Data will recognize
them in these recommendations, which is quite natural since he was one of the
experts audited by the ICT commission from the NATF.