## 1. Introduction

This post is about the search
for sense in a small data set, such as the few measures that one accumulates
through self-tracking. Most commonly, finding sense in a small set of data means either
to see regular patterns or to detect causality. Many writers have
argued that our brains are hardwired for detecting patterns and causality.
Causality is our basic ingredient for modelling “how the world works”. Inferring
causality from our world experience is also a way of “compressing” our knowledge:
once you understand that an open flame hurts, you don’t need to recall the
experiences (and you don’t need so many of them to detect this causality). The
reason for selecting this topic for today’s blog post is my recent participation
to the ROADEF 2019
conference. I had the pleasure of chairing the machine learning session and
the opportunity to present my own work about machine
learning for self-tracking data.

We are so good at detecting causality that we are often fooled by random situations
and tend to see patterns when there are none. This is a common theme of Nassim Taleb’s
many books and especially his master first book “Fooled by Randomness”.
The concept of “narrative fallacy” is critical when trying to extract sense
from observation, we need to remember that we love to see “stories” with a sense
because this is how our brain best remembers. There are two type of issues when
trying to mine short data sets for sense: the absence of statistical significance
because the data set is too small and the existence of our own narrative fallacy
and other cognitive biases. Today I will talk about data sets collected from self-tracking (i.e.
the continuous measurement of some of your characteristics, either explicitly while
logging observations or implicitly with connected sensors such as a connected
watch). The challenge of scientific methods when
searching for sense with such short time series is to know when to say “I don’t
know” when presented with a data set with no other form of patterns or
correlation that what could be expected in any random distribution, without
falling into the “pitfall of narrative fallacy”. In short, the “Turing test” of
causality hunting is to reject random or quasi-random data input.

On the other hand, it is tempting to look for algorithms that could learn
and extract sense from short time series precisely because humans are good at
it. Humans are actually very good at short-term forecasting and quick learning
which is without a doubt the consequence of evolution. Learning quickly to
forecast the path of a predator or a prey has been resolved with reinforcement
learning through “survival of the fittest” evolution. The topic of this blog
post – which I discussed at ROADEF – is how to make sense of a set of short
time series using machine learning algorithms. "Making sense" here is a combination
of forecasting and causality analysis which I will discuss later.

The second reason for this
blogpost is the wonderful book of Judea Pearl, “The
Book of Why”, which is a masterpiece about causality. The central idea of
the book is that causality does not “jump from the data” but requires an active
role from the observer. Judea Pearl introduces concepts which are deeply relevant
to this quest of search for sense with small data sets. Hunting for
causality is a “dangerous sport” for many reasons: most often you come back
empty-handed, sometimes you catch your own tail … and when successful, you most
often have little to show for your efforts. The two central ideas of causality
diagrams and the role of active observers are keys for unlocking some of the difficulties
of causality hunting with self-tracking data.

This post is organised as
follows. Section 2 is a very short and partial review of “The Book of Why”. I
will try to explain why Judea Pearl’s concepts are critical to causality hunting
with small data sets. These principles have been applied to the creation of a mobile application that generated the data
sets onto which the machine learning algorithm of Section 4 have been applied. This
application uses the concept of a causal diagram (renamed as quests) to embody
the user’s prior knowledge and assumptions. The self-measure follows the principle
of the “active observer” of Judea Pearl’s

*P(X | do(Y))*definition. Section 3 dives into causality hunting through two other books and introduced the concept of Granger causality that binds forecasting and causality detection. It also links the concept of pleasure and surprise with self-learning, a topic that I borrow from Michio Kaku and which also creates a strong relationship between forecasting and causality hunting. As noted by many scholars, “the ability to forecast is the most common form of intelligence”. Section 4 talks briefly about Machine Learning algorithms for short time-series forecasting. Without diving too deep into the technical aspects, I show what prediction from small data sets is difficult and what success could look like, considering all the pitfalls that we have presented before. Machine Learning from small data is not a topic for deep learning, thus I present an approach based on code generation and reinforcement learning.## 2. Causality Diagrams - Learn by Playing

Judea Pearl is an amazing
scientist with a long career about logic, models and causality that has earned
him a Turing Award in 2011. His book reminds me of “Thinking,
Fast and Slow” of Daniel Kahneman, a fantastic
effort of summarising decades of research into a book that is accessible and
very deep at the same time. “The Book of
Why – The new science of cause and effect” by Judea Pearl and Dana MacKenzie,
is a master piece about causality. It requires careful reading if ones want to extract
the full value of the content, but can also be enjoyed as a simple, exciting
read. A great part of the book deals with paradoxes of causality and confounders,
the variable that hide or explain causality relationships. In this section I
will only talk about four key ideas that are relevant to hunting causality from
small data

The first key idea of this book is

**causality is not a cold objective that one can extract from data without prior knowledge**. He refutes a “Big Data hypothesis” that would assume that once you have enough data, you can extract all necessary knowledge. He proposes a model for understanding causality with three levels : the first level is association, what we learn with observation; the second level is intervention, what we learn by doing things and the third level is counterfactuals, what we learn through imagining what-if scenarios. Trying to assess causality from observation only (for instance through conditional probabilities) is both very limited (ignoring the two top levels) but also quite tricky since as recalled by Persi Diaconis: “*Our brains are not just wired to do probability problems, so I am not surprised there were mistakes*”. Judea Pearl talk in depth about the Monty Hall problem, a great puzzle/paradox proposed by Marilyn Vos Savant, that has tricked many of the most educated minds. I urge you to read the book to learn for yourself from this great example. The author’s conclusion is: “*Decades’ worth of experience with this kind of questions has convinced me that, in both a cognitive and philosophical sense, the ideas of causes and effects is much more fundamental than the idea of probability*”.**Judea Pearl introduced the key concept of causal diagram to represent our prior preconception of causality**that may be reinforced or invalidated from observation, following a true Bayesian model. A causal diagram is a directed graph that represents your prior assumptions, as a network of factors/variable that have causal influence on each other. A causal diagram is a hypothesis that actual data from observation will validate or invalidate. The central idea here is that you cannot extract a causal diagram from the data, but that you need to formulate a hypothesis that you will keep or reject later, because the causal diagram gives you a scaffolder to analyse your data. This is why any data collection with the Knomee mobile app that I mentioned earlier starts with a causal diagram (a "quest").

**Another key insight from the author is to emphasise a participating role to the user asking the causality question**, which is represented through the notation

*P(X | do(Y))*. Where the conditional probability

*P(X | Y)*is the probability of X being true when Y is observed,

*P(X | do(Y))*is the probability of X when the user chooses to “do Y”. The stupid example of learning that a flame burns your hand is actually meaningful to understand the power of “learning by doing”. One or two experiences would not be enough to infer the knowledge from the conditional probability

*P(hurts | hand in flame)*while the experience

*do(hand in flame)*means that you get very sure, very quick, about

*P(hurts | do(hand in flame))*. This observation is at the heart of personal self-tracking. The user is active and is not simply collecting data. She decides to do or not to do things that may influence the desired outcome. A user who is trying to decide whether drinking coffee affects her sleep is actually computing

*P(sleep | do(coffee))*. Data collection is an experience, and it has a profound impact on the knowledge that may be extracted from the observations. This is very similar to the key concept that data is a circular flow in most AI smart systems. Smart systems are cybernetic systems with “a human inside”, not deductive linear systems that derive knowledge from static data. One should recognise here a key finding from the NATF reports on Artificial Intelligence and Machine Learning (see “Artificial Intelligence Applications Ecosystem: How to Grow Reinforcing Loops”).

**The role of the participant is especially important because there is a fair amount of subjectivity when hunting for causality**. Judea Pearl gives many examples where the controlling factors should be influenced by the “prior belief” of the experimenters, at the risk of misreading the data. He writes: “

*When causation is concerned, a grain of wise subjectivity tells us more about the real world that any amount of objectivity*”. He also insists on the importance of the data collection process. For him, one of the reasons statisticians are often the most puzzled with the Monty Hall paradox is the habit of looking at data as a flat static table: “

*No wonder statisticians found this puzzle hard to comprehend. They are accustomed to, as R.A. Fisher (1922) puts it, “the reduction of the data” and ignoring the data-generation process*”. As told earlier, I strongly encourage you to read the book to learn about “counfounders” – that are easy to explain with causal diagram – and how they play a critical role for these types of causality paradox where the intuition is easily fooled. This is the heart of this book: “

*I consider the complete solution of the counfounders problem one of the main highlights of the Causal Revolution because it has ended an era of confusion that has probably resulted in many wrong decisions in the past*”.

## 3. Finding a Diamond in the Rough

Another interesting book about hunting for causality is “Why: A Guide to Finding and Using Causes”
by Samantha Kleinberg. This books starts
with the idea that causality is hard to understand and hard to establish. Saying
that “correlation is not causation” is not enough, understanding causation is more
complex. Statistics do help to establish correlation, but people are prone to
see correlation when none exists: “

*many cognitive biases lead to us seeing correlations where none exist because we often seek information that confirms our beliefs*”. Once we validate a correlation with statistics tool, one needs to be careful because even seasoned statisticians “*cannot resists treating correlations as if they were causal*”.**Samantha Kleinberg talks about Granger Causality**: “

*one commonly used method for inference with continuous-valued time series data is Granger*”, the idea that if there is a time delay observed within a correlation, this may be a hint of causality. Judea Pearl warns us that this may be simply the case of a counfounder with asymmetric delays, but in practice the test of Granger causality is not a proof but a good indicator for causality. The proper wording is that this test is a good indicator for “predictive causality”. More generally, if predicting a value Y from the past of X up to a non-null delay does a good job, it may be said that there is a good chance of “predictive causality” from X to Y. This links the tool of forecasting to our goal of causality hunting. It is an interesting tool since it may be used with non-linear models (contrary to Granger Causality) and multi-variate analysis. If we start from a causal diagram in the Pearl’s sense, we may see if the root nodes (the hypothetical causes) may be used successfully to predict the future of the target nodes (the hypothetical “effects”). This is, in a nutshell, how the Knomee mobile app operates: it collects data associated to a causal diagram and uses forecasting as a possible indicator of “predictive causality”.

The search of “why” with
self-tracking data is quite interesting because most values (heart rate, mood,
weight, number of steps, etc.) are nonstationary on a short time scale, but
bounded on a long-time horizon while exhibiting a lot of daily variation. This
makes detecting patterns more difficult since this is quite different from
extrapolating the movement of a predator for its previous positions (another short
time series). We are much better at “understanding” patterns that derive from linear
relations than those that emerge from complex causality loops with delays

**. The analysis of delays between two observations (at the heart of the Granger Causality) is also a key tool in complex system analysis**. We must, therefore, bring it with us when hunting for causality. This is why the Knomee app includes multiple correlation/delay analysis to confirm or invalidate the causal hypothesis.
A few other pearls of wisdom about causality hunting with self-tracking
may be found in the book from Gina Neff and Dawn Nafus. This reference book on
quantified self and self-tracking crosses a number of ideas that we have already
exposed, such as the critical importance of the user in the tracking and learning
process.

**Self-tracking**– a practice which is both very ancient and has shown value repeatedly –**is usually boring if no sense is derived from the experiment**. Making sense is either positive, such as finding causality, or negative, such as disproving a causality hypothesis. Because we can collect data more efficiently in the digital world, the quest for sense is even more important: “*Sometimes our capacity to gather data outpaces our ability to make sense of it*”. In the first part of this book we find this statement which echoes nicely the principles of Judea Pearl: “*A further goal of this book is to show how self-experimentation with data forces us to wrestle with the uncertain line between evidence and belief, and how we come to decisions about what is and is not legitimate knowledge*”. We have talked about small data and short time-series from the beginning because experience shows that most users collect data over long period of time: “*Self-tracking projects should start out as brief experiments that are done, say, over a few days or a few weeks. While there are different benefits to tracking over months or years, a first project should not commit you for the long haul*”. This is why we shall focus in the next section on algorithms that can work robustly with a small amount of data.**Self-tracking is foremost a learning experiment**: “

*The norm within QS is that “good” self-tracking happens when some learning took place, regardless of what kind of learning it was*”. A further motive for self-tracking is often behavioural change, which is also a form of self-learning. A biologists tell us, learning is most often associated with pleasure and reward. As pointed out in a previous post, there is a continuous cycle : pleasure to desire to plan to action to pleasure, that is a common foundation for most learning with living creatures. Therefore, there is a dual dependency between pleasure and learning when self-tracking: one must learn (make sense out the collected data) to stay motivated and to pursue the self-tracking experience (which is never very long) and this experience should reward the user from some forms of pleasure, from surprise and fun to the satisfaction of learning something about yourself.

**Forecasting is a natural part of the human learning process**. We constantly forecast what will happen and learn by reacting to the difference. As explained by Michio Kaku, our sense of humour and the pleasure that we associate with surprises is a Darwinian mechanism to push us to constantly improve our forecasting (and modelling abilities). We forecast continuously, we experience the reality and we enjoy the surprise (the difference between what happens and what we expect) as an opportunity to learn in a Bayesian way, that is to reassign our prior assumptions (our model of the world). The importance of curiosity as a key factor for learning is now widely accepted in the machine learning community as illustrated in this ICML 2017 paper: “Curiosity-driven Exploration by Self-supervised Prediction”. The role of surprise and fun in learning is another reason to be interested in forecasting algorithms. Forecasting the future, even if unreliable, creates positive emotions around self-tracking. This is quite general: we enjoy forecasts, which we see as games (in addition of their intrinsic value) – one can think of sports or politics as example. A self-tracking forecasting algorithm that does a decent job (i.e., not too wrong nor too often) works in a way similar to our brain: it is invisible but acts as a time saver most of the times, and when wrong it signals a moment of interest. We shall now come back to the topic of forecasting algorithms for short time-series, since we have established that they could play an interesting role for causality hunting.

## 4. Machine Generation of Robust Algorithms

Our goal in this last section
is to look at

**the design of robust algorithms for short time series forecasting**. Let us first define what I mean by*robust*, which will explain the metaphor which was proposed in the introduction. The following figure is extracted from my ROADEF presentation, it represents two possible types of “quests” (causal diagrams). Think of a quest as a variable that we try to analyse, together with other variables (the “factors”) which we think might explain the main variable. The vertical axis represents a classification of the variation that is observed into three categories: the random noise in red, the variation that is due to factors that were not collected in the sample in orange, and the green area is the part that we may associate with the factors. A robust algorithm is a forecasting algorithm that accepts an important part of randomness, to the point that many quests are “pointless” (remember the “Turing test of incomplete forecasting”). A robust algorithm should be able to exploit the positive influence of the factors (in green) when and if it exists. The picture makes it clear that we should not expect miracles: a good forecasting algorithm can only improve by a few percent over the simple prediction of the average values. What is actually difficult is to design an algorithm that is not worse – because of overfitting – than average prediction when given a quasi-random input (right column on the picture).
As the title of the section suggests,

**I have experimented with machine generation of forecasting algorithms**. This technique is also called meta-programming: a first algorithm produces code that represents a forecasting algorithm. I have used this approach many times in the past decades, from complex optimization problems to evolutionary game theory. I found that it was interesting many years ago when working on TV audience forecasting, because it is a good way to avoid over-fitting, which is a common plague when doing machine learning over a small data set, and to control the robustness properties thanks to evolutionary meta-techniques. The principle is to create a*term algebra*that represents instantiations and combinations of simpler algorithm. Think of it as a tool box. One lever of control (robustness and over-fitting) is to make sure that you only select “robust tools” to put in the box. This means that you may not obtain the best or more complex machine learning algorithm such as deep learning, but you ensure both “explainability” and control. The meta-algorithm is an evolutionary randomised search algorithm (similar to the Monte-Carlo Tree Search of Alpha Zero) that may be sophisticated (using genetic combinations of terms) or simple (which is what we use for short time series).
The forecasting algorithm used by the Knomee app
is produced locally on the user phone from the collected data. To test
robustness, we have collected self-tracking data over the past two years - for those of you who are curious to apply other techniques,
the data is available
on GitHub. The forecasting algorithm is the fixed-point of an evolutionary
search. This is very similar to reinforcement learning in the sense that each
iteration is directed by a fitness function that describes the accuracy of the
forecasting (modulo regularization,
as explained in the presentation).
The training protocol is defined as running the resulting forecasting algorithm
on each sample of the data set (a quest) and for each time position from 2/3 to
3/3 of the ordered time series. In other words, the score that we use is the average
precision of the forecasting that a user would experience in the last third of
the data collection process. The term-algebra that is used to represent and to
generate forecasting algorithms is made of simple heuristics such as

*regression*and*movingAverage*, of weekly and hourly time patterns, and correlation analysis with threshold, cumulative and delays options. With the proper choice of meta-parameters to tune the evolutionary search (such as the fitness function or the depth and scope of local optimisation), this approach is able to generate a robust algorithm, that is : (1) that generates better forecasts than average (although not by much) (2) that is not thrown away by pseudo-random time series . Let me state clearly that this approach is not a “silver bullet”. I have compared the algorithm produced by this evolutionary search with the classical and simple machine learning approaches that one would use for time series:*Regression*,*k-means*clustering and*ARMA*. I refer you to the great book “Machine Learning for the Quantified Self” by M. Hoogendoorn and B. Funk for a complete survey on how to use machine learning with self-tracking data. On regular data (such as sales time series), the classical algorithms perform slightly better that evolutionary code generation. However, when real self-tracking data is used with all its randomness, evolutionary search manages to synthesise robust algorithms, which none of the three classical algorithms are.## 5. Conclusion

This topic
is more complex than many of the subjects that I address here. I have tried to
stay away from the truly technical aspects, at the expense of scientific precision.
I will conclude this post with a very short summary:

**Causality hunting is a fascinating topic**. As we accumulate more and more data, and as Artificial Intelligence tools become more powerful, it is quite logical to hunt for causality and to build models that represent a fragment of our world knowledge through machine learning. This is, for instance, the heart of the Causality Link startup led by my friend Pierre Haren, which builds automatically knowledge graphs from textual data while extracting causal links, which is then use for deep situation analysis with scenarios.**Causality hunting is hard**, especially with small data and even more**with “Quantified Self” data**, because of the random nature of many of the time series that are collected with connected devices. It is also hard because we cannot track everything and quite often what we are looking for depends on other variable (the orange part of the previous picture).**Forecasting is an interesting tool for causality hunting**. This is counter-intuitive since forecasting is close to impossible with self-tracking data. A better formulation should be: “ a moderate amount of robust forecasting may help with causality hunting". Forecasting gives a hint of “predictive causality”, in the sense of the Granger causality, and it also serves to enrich the pleasure-surprise-discovery learning loop of self-tracking.- Machine code generation through reinforcement learning is a powerful technique for short time-series forecasting. Code generating algorithms try to assemble building blocks from a given set to match a given output. When applied to self-tracking forecasting, this technique allows to craft algorithms that are robust to random noise (to recognise the data as such) and able to extract a weak correlative signal from a complex (although short) data set.