1. Introduction
This post is about the search
for sense in a small data set, such as the few measures that one accumulates
through self-tracking. Most commonly, finding sense in a small set of data means either
to see regular patterns or to detect causality. Many writers have
argued that our brains are hardwired for detecting patterns and causality.
Causality is our basic ingredient for modelling “how the world works”. Inferring
causality from our world experience is also a way of “compressing” our knowledge:
once you understand that an open flame hurts, you don’t need to recall the
experiences (and you don’t need so many of them to detect this causality). The
reason for selecting this topic for today’s blog post is my recent participation
to the ROADEF 2019
conference. I had the pleasure of chairing the machine learning session and
the opportunity to present my own work about machine
learning for self-tracking data.
We are so good at detecting causality that we are often fooled by random situations
and tend to see patterns when there are none. This is a common theme of Nassim Taleb’s
many books and especially his master first book “Fooled by Randomness”.
The concept of “narrative fallacy” is critical when trying to extract sense
from observation, we need to remember that we love to see “stories” with a sense
because this is how our brain best remembers. There are two type of issues when
trying to mine short data sets for sense: the absence of statistical significance
because the data set is too small and the existence of our own narrative fallacy
and other cognitive biases. Today I will talk about data sets collected from self-tracking (i.e.
the continuous measurement of some of your characteristics, either explicitly while
logging observations or implicitly with connected sensors such as a connected
watch). The challenge of scientific methods when
searching for sense with such short time series is to know when to say “I don’t
know” when presented with a data set with no other form of patterns or
correlation that what could be expected in any random distribution, without
falling into the “pitfall of narrative fallacy”. In short, the “Turing test” of
causality hunting is to reject random or quasi-random data input.
On the other hand, it is tempting to look for algorithms that could learn
and extract sense from short time series precisely because humans are good at
it. Humans are actually very good at short-term forecasting and quick learning
which is without a doubt the consequence of evolution. Learning quickly to
forecast the path of a predator or a prey has been resolved with reinforcement
learning through “survival of the fittest” evolution. The topic of this blog
post – which I discussed at ROADEF – is how to make sense of a set of short
time series using machine learning algorithms. "Making sense" here is a combination
of forecasting and causality analysis which I will discuss later.
The second reason for this
blogpost is the wonderful book of Judea Pearl, “The
Book of Why”, which is a masterpiece about causality. The central idea of
the book is that causality does not “jump from the data” but requires an active
role from the observer. Judea Pearl introduces concepts which are deeply relevant
to this quest of search for sense with small data sets. Hunting for
causality is a “dangerous sport” for many reasons: most often you come back
empty-handed, sometimes you catch your own tail … and when successful, you most
often have little to show for your efforts. The two central ideas of causality
diagrams and the role of active observers are keys for unlocking some of the difficulties
of causality hunting with self-tracking data.
This post is organised as
follows. Section 2 is a very short and partial review of “The Book of Why”. I
will try to explain why Judea Pearl’s concepts are critical to causality hunting
with small data sets. These principles have been applied to the creation of a mobile application that generated the data
sets onto which the machine learning algorithm of Section 4 have been applied. This
application uses the concept of a causal diagram (renamed as quests) to embody
the user’s prior knowledge and assumptions. The self-measure follows the principle
of the “active observer” of Judea Pearl’s P(X | do(Y)) definition. Section 3
dives into causality hunting through two other books and introduced the concept
of Granger causality
that binds forecasting and causality detection. It also links the
concept of pleasure and surprise with self-learning, a topic that I borrow from
Michio Kaku
and which also creates a strong relationship between forecasting and causality
hunting. As
noted by many scholars, “the ability to forecast is the most common form of
intelligence”. Section 4 talks briefly about Machine Learning algorithms for
short time-series forecasting. Without diving too deep into the technical
aspects, I show what prediction from small data sets is difficult and what
success could look like, considering all the pitfalls that we have presented before.
Machine Learning from small data is not a topic for deep learning, thus I present an
approach based on code generation and reinforcement learning.
2. Causality Diagrams - Learn by Playing
Judea Pearl is an amazing
scientist with a long career about logic, models and causality that has earned
him a Turing Award in 2011. His book reminds me of “Thinking,
Fast and Slow” of Daniel Kahneman, a fantastic
effort of summarising decades of research into a book that is accessible and
very deep at the same time. “The Book of
Why – The new science of cause and effect” by Judea Pearl and Dana MacKenzie,
is a master piece about causality. It requires careful reading if ones want to extract
the full value of the content, but can also be enjoyed as a simple, exciting
read. A great part of the book deals with paradoxes of causality and confounders,
the variable that hide or explain causality relationships. In this section I
will only talk about four key ideas that are relevant to hunting causality from
small data
The first key idea of this book is causality
is not a cold objective that one can extract from data without prior knowledge.
He refutes a “Big Data hypothesis” that would assume that once you have enough
data, you can extract all necessary knowledge. He proposes a model for understanding
causality with three
levels : the first level is association,
what we learn with observation; the second level is intervention, what we learn
by doing things and the third level is counterfactuals, what we learn through
imagining what-if scenarios. Trying to assess causality from observation only
(for instance through conditional probabilities) is both very limited (ignoring
the two top levels) but also quite tricky since as recalled by Persi Diaconis: “Our brains are not just wired to do
probability problems, so I am not surprised there were mistakes”. Judea
Pearl talk in depth about the Monty Hall problem,
a great puzzle/paradox proposed by Marilyn Vos Savant, that has tricked many of
the most educated minds. I urge you to read the book to learn for yourself from
this great example. The author’s conclusion is: “Decades’ worth of experience with this kind of questions has convinced
me that, in both a cognitive and philosophical sense, the ideas of causes and
effects is much more fundamental than the idea of probability”.
Judea Pearl introduced the key
concept of causal diagram to represent our prior preconception of causality that may be
reinforced or invalidated from observation, following a true Bayesian model. A causal diagram is a
directed graph that represents your prior assumptions, as a network of factors/variable
that have causal influence on each other. A causal diagram is a hypothesis that
actual data from observation will validate or invalidate. The central idea here
is that you cannot extract a causal diagram from the data, but that you need to
formulate a hypothesis that you will keep or reject later, because the causal
diagram gives you a scaffolder to analyse your data. This is why any data
collection with the Knomee
mobile app that I mentioned earlier starts with a causal diagram (a "quest").
Another key insight from the author is to emphasise a participating role to
the user asking the causality question, which is represented
through the notation P(X | do(Y)). Where
the conditional probability P(X | Y)
is the probability of X being true when Y is observed, P(X | do(Y)) is the probability of X when the user chooses to “do Y”.
The stupid example of learning that a flame burns your hand is actually meaningful
to understand the power of “learning by doing”. One or two experiences would not
be enough to infer the knowledge from the conditional probability P(hurts | hand in flame) while the
experience do(hand in flame) means that you get very sure, very quick, about P(hurts | do(hand in flame)). This observation
is at the heart of personal self-tracking. The user is active and is not simply
collecting data. She decides to do or not to do things that may influence the
desired outcome. A user who is trying to decide whether drinking coffee affects
her sleep is actually computing P(sleep |
do(coffee)). Data collection is an experience, and it has a profound impact
on the knowledge that may be extracted from the observations. This is very similar
to the key concept that data is a circular flow in most AI smart systems. Smart
systems are cybernetic systems with “a human inside”, not deductive linear systems
that derive knowledge from static data. One should recognise here a key finding
from the NATF reports on Artificial Intelligence and Machine Learning (see “Artificial
Intelligence Applications Ecosystem: How to Grow Reinforcing Loops”).
The role of the participant is
especially important because there is a fair amount of subjectivity when hunting
for causality. Judea Pearl gives many examples
where the controlling factors should be influenced by the “prior belief” of the
experimenters, at the risk of misreading the data. He writes: “When
causation is concerned, a grain of wise subjectivity tells us more about the real
world that any amount of objectivity”. He also insists on the importance of
the data collection process. For him, one of the reasons statisticians are
often the most puzzled with the Monty Hall paradox is the habit of looking at data
as a flat static table: “No wonder statisticians
found this puzzle hard to comprehend. They are accustomed to, as R.A. Fisher (1922)
puts it, “the reduction of the data” and ignoring the data-generation process”.
As told earlier, I strongly encourage you to read the book to learn about “counfounders” – that are
easy to explain with causal diagram – and how they play a critical role for
these types of causality paradox where the intuition is easily fooled. This is
the heart of this book: “ I consider the complete
solution of the counfounders problem one of the main highlights of the Causal
Revolution because it has ended an era of confusion that has probably resulted
in many wrong decisions in the past”.
3. Finding a Diamond in the Rough
Another interesting book about hunting for causality is “Why: A Guide to Finding and Using Causes”
by Samantha Kleinberg. This books starts
with the idea that causality is hard to understand and hard to establish. Saying
that “correlation is not causation” is not enough, understanding causation is more
complex. Statistics do help to establish correlation, but people are prone to
see correlation when none exists: “many cognitive biases lead to us seeing correlations where none exist
because we often seek information that confirms our beliefs”. Once we validate a correlation with statistics tool, one needs to be
careful because even seasoned statisticians “cannot resists treating correlations as if they were causal”.
Samantha Kleinberg talks about
Granger Causality: “one commonly used method for
inference with continuous-valued time series data is Granger”, the idea
that if there is a time delay observed within a correlation, this may be a hint
of causality. Judea Pearl warns us that this may be simply the case of a
counfounder with asymmetric delays, but in practice the test of Granger causality
is not a proof but a good indicator for causality. The proper wording is that
this test is a good indicator for “predictive causality”. More generally, if
predicting a value Y from the past of X up to a non-null delay does a good job,
it may be said that there is a good chance of “predictive causality”
from X to Y. This links the tool of forecasting to our goal of causality
hunting. It is an interesting tool since it may be used with non-linear models
(contrary to Granger Causality) and multi-variate analysis. If we start from a
causal diagram in the Pearl’s sense, we may see if the root nodes (the
hypothetical causes) may be used successfully to predict the future of the
target nodes (the hypothetical “effects”). This is, in a nutshell, how the
Knomee mobile app operates: it collects data associated to a causal diagram and
uses forecasting as a possible indicator of “predictive causality”.
The search of “why” with
self-tracking data is quite interesting because most values (heart rate, mood,
weight, number of steps, etc.) are nonstationary on a short time scale, but
bounded on a long-time horizon while exhibiting a lot of daily variation. This
makes detecting patterns more difficult since this is quite different from
extrapolating the movement of a predator for its previous positions (another short
time series). We are much better at “understanding” patterns that derive from linear
relations than those that emerge from complex causality loops with delays. The
analysis of delays between two observations (at the heart of the Granger
Causality) is also a key tool in complex system analysis. We must,
therefore, bring it with us when hunting for causality. This is why the Knomee
app includes multiple correlation/delay analysis to confirm or invalidate the
causal hypothesis.
A few other pearls of wisdom about causality hunting with self-tracking
may be found in the book from Gina Neff and Dawn Nafus. This reference book on
quantified self and self-tracking crosses a number of ideas that we have already
exposed, such as the critical importance of the user in the tracking and learning
process. Self-tracking – a practice which
is both very ancient and has shown value repeatedly – is usually boring if no sense is derived from the experiment.
Making sense is either positive, such as finding causality, or negative, such
as disproving a causality hypothesis. Because we can collect data more efficiently
in the digital world, the quest for sense is even more important: “Sometimes our capacity to gather data
outpaces our ability to make sense of it”. In the first part of this book we find this
statement which echoes nicely the principles of Judea Pearl: “A further goal of this book is to show how
self-experimentation with data forces us to wrestle with the uncertain line
between evidence and belief, and how we come to decisions about what is and is
not legitimate knowledge”. We have
talked about small data and short time-series from the beginning because
experience shows that most users collect data over long period of time: “Self-tracking projects should start out as
brief experiments that are done, say, over a few days or a few weeks. While
there are different benefits to tracking over months or years, a first project
should not commit you for the long haul”. This is why we shall focus in the next section
on algorithms that can work robustly with a small amount of data.
Self-tracking
is foremost a learning experiment: “The
norm within QS is that “good” self-tracking happens when some learning took
place, regardless of what kind of learning it was”. A further motive for
self-tracking is often behavioural change, which is also a form of
self-learning. A biologists tell us, learning is most often associated with
pleasure and reward. As pointed out in
a previous post, there is a continuous cycle : pleasure to desire to plan to action to pleasure, that is a common foundation for most learning
with living creatures. Therefore, there is a dual dependency between pleasure
and learning when self-tracking: one must learn (make sense out the collected data)
to stay motivated and to pursue the self-tracking experience (which is never very
long) and this experience should reward the user from some forms of pleasure,
from surprise and fun to the satisfaction of learning something about yourself.
Forecasting is a natural part of the human learning
process. We
constantly forecast what will happen and learn by reacting to the difference. As
explained by Michio Kaku, our sense of humour and the pleasure that we
associate with surprises is a Darwinian mechanism to push us to constantly
improve our forecasting (and modelling abilities). We forecast continuously, we
experience the reality and we enjoy the surprise (the difference between what
happens and what we expect) as an opportunity to learn in a Bayesian way, that
is to reassign our prior assumptions (our model of the world). The importance
of curiosity as a key factor for learning is now widely accepted in the machine
learning community as illustrated in this ICML 2017 paper: “Curiosity-driven Exploration by Self-supervised
Prediction”. The
role of surprise and fun in learning is another reason to be interested in
forecasting algorithms. Forecasting the future, even if unreliable, creates
positive emotions around self-tracking. This is quite general: we enjoy
forecasts, which we see as games (in addition of their intrinsic value) – one
can think of sports or politics as example. A self-tracking forecasting
algorithm that does a decent job (i.e., not too wrong nor too often) works in a
way similar to our brain: it is invisible but acts as a time saver most of the times,
and when wrong it signals a moment of interest. We shall now come back to the topic
of forecasting algorithms for short time-series, since we have established that
they could play an interesting role for causality hunting.
4. Machine Generation of Robust Algorithms
Our goal in this last section
is to look at the design of robust algorithms
for short time series forecasting. Let us first define what I mean by robust,
which will explain the metaphor which was proposed in the introduction. The
following figure is extracted from my ROADEF
presentation, it represents two possible types of “quests” (causal diagrams).
Think of a quest as a variable that we try to analyse, together with other
variables (the “factors”) which we think might explain the main variable. The
vertical axis represents a classification of the variation that is observed
into three categories: the random noise in red, the variation that is due to
factors that were not collected in the sample in orange, and the green area is the part that
we may associate with the factors. A robust algorithm is a forecasting
algorithm that accepts an important part of randomness, to the point that many
quests are “pointless” (remember the “Turing test of incomplete forecasting”).
A robust algorithm should be able to exploit the positive influence of the
factors (in green) when and if it exists. The picture makes it clear that we should
not expect miracles: a good forecasting algorithm can only improve by a few percent
over the simple prediction of the average values. What is actually difficult is
to design an algorithm that is not worse – because of overfitting – than average
prediction when given a quasi-random input (right column on the picture).
As the title of the section suggests,
I have experimented with machine
generation of forecasting algorithms. This technique is also called meta-programming:
a first algorithm produces code that represents a forecasting algorithm. I have
used this approach many times in the past decades, from complex optimization problems to evolutionary
game theory. I found that it was interesting many years ago when working on
TV audience forecasting, because it is a good way to avoid over-fitting, which
is a common plague when doing machine learning over a small data set, and to
control the robustness properties thanks to evolutionary meta-techniques. The
principle is to create a term algebra that represents instantiations and combinations
of simpler algorithm. Think of it as a tool box. One lever of control
(robustness and over-fitting) is to make sure that you only select “robust
tools” to put in the box. This means that you may not obtain the best or more
complex machine learning algorithm such as deep learning, but you ensure both “explainability”
and control. The meta-algorithm is an evolutionary randomised search algorithm
(similar to the Monte-Carlo
Tree Search of Alpha
Zero) that may be sophisticated (using genetic combinations of terms) or
simple (which is what we use for short time series).
The forecasting algorithm used by the Knomee app
is produced locally on the user phone from the collected data. To test
robustness, we have collected self-tracking data over the past two years - for those of you who are curious to apply other techniques,
the data is available
on GitHub. The forecasting algorithm is the fixed-point of an evolutionary
search. This is very similar to reinforcement learning in the sense that each
iteration is directed by a fitness function that describes the accuracy of the
forecasting (modulo regularization,
as explained in the presentation).
The training protocol is defined as running the resulting forecasting algorithm
on each sample of the data set (a quest) and for each time position from 2/3 to
3/3 of the ordered time series. In other words, the score that we use is the average
precision of the forecasting that a user would experience in the last third of
the data collection process. The term-algebra that is used to represent and to
generate forecasting algorithms is made of simple heuristics such as regression and movingAverage, of weekly and hourly time patterns, and correlation
analysis with threshold, cumulative and delays options. With the proper choice
of meta-parameters to tune the evolutionary search (such as the fitness
function or the depth and scope of local optimisation), this approach is able
to generate a robust algorithm, that is : (1) that generates better forecasts
than average (although not by much) (2) that is not thrown away by pseudo-random
time series . Let me state clearly that this approach is not a “silver bullet”.
I have compared the algorithm produced by this evolutionary search with the
classical and simple machine learning approaches that one would use for time series:
Regression, k-means clustering and ARMA.
I refer you to the great book “Machine Learning for the Quantified
Self” by M. Hoogendoorn and B. Funk for a complete survey on how
to use machine learning with self-tracking data. On regular data (such as sales
time series), the classical algorithms perform slightly better that evolutionary
code generation. However, when real self-tracking data is used with all its
randomness, evolutionary search manages to synthesise robust algorithms, which none
of the three classical algorithms are.
5. Conclusion
This topic
is more complex than many of the subjects that I address here. I have tried to
stay away from the truly technical aspects, at the expense of scientific precision.
I will conclude this post with a very short summary:
- Causality hunting is a fascinating topic. As we accumulate more and more data, and as Artificial Intelligence tools become more powerful, it is quite logical to hunt for causality and to build models that represent a fragment of our world knowledge through machine learning. This is, for instance, the heart of the Causality Link startup led by my friend Pierre Haren, which builds automatically knowledge graphs from textual data while extracting causal links, which is then use for deep situation analysis with scenarios.
- Causality hunting is hard, especially with small data and even more with “Quantified Self” data, because of the random nature of many of the time series that are collected with connected devices. It is also hard because we cannot track everything and quite often what we are looking for depends on other variable (the orange part of the previous picture).
- Forecasting is an interesting tool for causality hunting. This is counter-intuitive since forecasting is close to impossible with self-tracking data. A better formulation should be: “ a moderate amount of robust forecasting may help with causality hunting". Forecasting gives a hint of “predictive causality”, in the sense of the Granger causality, and it also serves to enrich the pleasure-surprise-discovery learning loop of self-tracking.
- Machine code generation through reinforcement learning is a powerful technique for short time-series forecasting. Code generating algorithms try to assemble building blocks from a given set to match a given output. When applied to self-tracking forecasting, this technique allows to craft algorithms that are robust to random noise (to recognise the data as such) and able to extract a weak correlative signal from a complex (although short) data set.