Tuesday, April 9, 2019

Hunting for Causality in Short Time Series


1. Introduction


This post is about the search for sense in a small data set, such as the few measures that one accumulates through self-tracking. Most commonly, finding sense in a small set of data means either to see regular patterns or to detect causality. Many writers have argued that our brains are hardwired for detecting patterns and causality. Causality is our basic ingredient for modelling “how the world works”. Inferring causality from our world experience is also a way of “compressing” our knowledge: once you understand that an open flame hurts, you don’t need to recall the experiences (and you don’t need so many of them to detect this causality). The reason for selecting this topic for today’s blog post is my recent participation to the ROADEF 2019 conference. I had the pleasure of chairing the machine learning session and the opportunity to present my own work about machine learning for self-tracking data.

We are so good at detecting causality that we are often fooled by random situations and tend to see patterns when there are none. This is a common theme of Nassim Taleb’s many books and especially his master first book “Fooled by Randomness”. The concept of “narrative fallacy” is critical when trying to extract sense from observation, we need to remember that we love to see “stories” with a sense because this is how our brain best remembers. There are two type of issues when trying to mine short data sets for sense: the absence of statistical significance because the data set is too small and the existence of our own narrative fallacy and other cognitive biases. Today I will talk about data sets collected from self-tracking (i.e. the continuous measurement of some of your characteristics, either explicitly while logging observations or implicitly with connected sensors such as a connected watch). The challenge of scientific methods when searching for sense with such short time series is to know when to say “I don’t know” when presented with a data set with no other form of patterns or correlation that what could be expected in any random distribution, without falling into the “pitfall of narrative fallacy”. In short, the “Turing test” of causality hunting is to reject random or quasi-random data input.
On the other hand, it is tempting to look for algorithms that could learn and extract sense from short time series precisely because humans are good at it. Humans are actually very good at short-term forecasting and quick learning which is without a doubt the consequence of evolution. Learning quickly to forecast the path of a predator or a prey has been resolved with reinforcement learning through “survival of the fittest” evolution. The topic of this blog post – which I discussed at ROADEF – is how to make sense of a set of short time series using machine learning algorithms. "Making sense" here is a combination of forecasting and causality analysis which I will discuss later.
The second reason for this blogpost is the wonderful book of Judea Pearl, “The Book of Why”, which is a masterpiece about causality. The central idea of the book is that causality does not “jump from the data” but requires an active role from the observer. Judea Pearl introduces concepts which are deeply relevant to this quest of search for sense with small data sets.  Hunting for causality is a “dangerous sport” for many reasons: most often you come back empty-handed, sometimes you catch your own tail … and when successful, you most often have little to show for your efforts. The two central ideas of causality diagrams and the role of active observers are keys for unlocking some of the difficulties of causality hunting with self-tracking data.

This post is organised as follows. Section 2 is a very short and partial review of “The Book of Why”. I will try to explain why Judea Pearl’s concepts are critical to causality hunting with small data sets. These principles have been applied to the creation of a mobile application that generated the data sets onto which the machine learning algorithm of Section 4 have been applied. This application uses the concept of a causal diagram (renamed as quests) to embody the user’s prior knowledge and assumptions. The self-measure follows the principle of the “active observer” of Judea Pearl’s P(X | do(Y)) definition. Section 3 dives into causality hunting through two other books and introduced the concept of Granger causality that binds forecasting and causality detection. It also links the concept of pleasure and surprise with self-learning, a topic that I borrow from Michio Kaku and which also creates a strong relationship between forecasting and causality hunting. As noted by many scholars, “the ability to forecast is the most common form of intelligence”. Section 4 talks briefly about Machine Learning algorithms for short time-series forecasting. Without diving too deep into the technical aspects, I show what prediction from small data sets is difficult and what success could look like, considering all the pitfalls that we have presented before. Machine Learning from small data is not a topic for deep learning, thus I present an approach based on code generation and reinforcement learning.

2. Causality Diagrams - Learn by Playing



Judea Pearl is an amazing scientist with a long career about logic, models and causality that has earned him a Turing Award in 2011. His book reminds me of “Thinking, Fast and Slow”  of Daniel Kahneman, a fantastic effort of summarising decades of research into a book that is accessible and very deep at the same time.  “The Book of Why – The new science of cause and effect” by Judea Pearl and Dana MacKenzie, is a master piece about causality. It requires careful reading if ones want to extract the full value of the content, but can also be enjoyed as a simple, exciting read. A great part of the book deals with paradoxes of causality and confounders, the variable that hide or explain causality relationships. In this section I will only talk about four key ideas that are relevant to hunting causality from small data

The first key idea of this book is causality is not a cold objective that one can extract from data without prior knowledge. He refutes a “Big Data hypothesis” that would assume that once you have enough data, you can extract all necessary knowledge. He proposes a model for understanding causality with three levels :  the first level is association, what we learn with observation; the second level is intervention, what we learn by doing things and the third level is counterfactuals, what we learn through imagining what-if scenarios. Trying to assess causality from observation only (for instance through conditional probabilities) is both very limited (ignoring the two top levels) but also quite tricky since as recalled by Persi Diaconis: “Our brains are not just wired to do probability problems, so I am not surprised there were mistakes”. Judea Pearl talk in depth about the Monty Hall problem, a great puzzle/paradox proposed by Marilyn Vos Savant, that has tricked many of the most educated minds. I urge you to read the book to learn for yourself from this great example. The author’s conclusion is: “Decades’ worth of experience with this kind of questions has convinced me that, in both a cognitive and philosophical sense, the ideas of causes and effects is much more fundamental than the idea of probability”.
Judea Pearl introduced the key concept of causal diagram to represent our prior preconception of causality that may be reinforced or invalidated from observation, following a true Bayesian model. A causal diagram is a directed graph that represents your prior assumptions, as a network of factors/variable that have causal influence on each other. A causal diagram is a hypothesis that actual data from observation will validate or invalidate. The central idea here is that you cannot extract a causal diagram from the data, but that you need to formulate a hypothesis that you will keep or reject later, because the causal diagram gives you a scaffolder to analyse your data. This is why any data collection with the Knomee mobile app that I mentioned earlier starts with a causal diagram (a "quest").
Another key insight from the author is to emphasise a participating role to the user asking the causality question, which is represented through the notation P(X | do(Y)). Where the conditional probability P(X | Y) is the probability of X being true when Y is observed, P(X | do(Y)) is the probability of X when the user chooses to “do Y”. The stupid example of learning that a flame burns your hand is actually meaningful to understand the power of “learning by doing”. One or two experiences would not be enough to infer the knowledge from the conditional probability P(hurts | hand in flame) while the experience do(hand in flame) means that you get very sure, very quick, about P(hurts | do(hand in flame)). This observation is at the heart of personal self-tracking. The user is active and is not simply collecting data. She decides to do or not to do things that may influence the desired outcome. A user who is trying to decide whether drinking coffee affects her sleep is actually computing P(sleep | do(coffee)). Data collection is an experience, and it has a profound impact on the knowledge that may be extracted from the observations. This is very similar to the key concept that data is a circular flow in most AI smart systems. Smart systems are cybernetic systems with “a human inside”, not deductive linear systems that derive knowledge from static data. One should recognise here a key finding from the NATF reports on Artificial Intelligence and Machine Learning (see “Artificial Intelligence Applications Ecosystem: How to Grow Reinforcing Loops”).

The role of the participant is especially important because there is a fair amount of subjectivity when hunting for causality. Judea Pearl gives many examples where the controlling factors should be influenced by the “prior belief” of the experimenters, at the risk of misreading the data. He writes:  “When causation is concerned, a grain of wise subjectivity tells us more about the real world that any amount of objectivity”. He also insists on the importance of the data collection process. For him, one of the reasons statisticians are often the most puzzled with the Monty Hall paradox is the habit of looking at data as a flat static table: “No wonder statisticians found this puzzle hard to comprehend. They are accustomed to, as R.A. Fisher (1922) puts it, “the reduction of the data” and ignoring the data-generation process”. As told earlier, I strongly encourage you to read the book to learn about “counfounders” – that are easy to explain with causal diagram – and how they play a critical role for these types of causality paradox where the intuition is easily fooled. This is the heart of this book: “ I consider the complete solution of the counfounders problem one of the main highlights of the Causal Revolution because it has ended an era of confusion that has probably resulted in many wrong decisions in the past”.

3. Finding a Diamond in the Rough



Another interesting book about hunting for causality is “Why: A Guide to Finding and Using Causes” by Samantha Kleinberg. This books starts with the idea that causality is hard to understand and hard to establish. Saying that “correlation is not causation” is not enough, understanding causation is more complex. Statistics do help to establish correlation, but people are prone to see correlation when none exists: “many cognitive biases lead to us seeing correlations where none exist because we often seek information that confirms our beliefs”. Once we validate a correlation with statistics tool, one needs to be careful because even seasoned statisticians “cannot resists treating correlations as if they were causal”.
Samantha Kleinberg talks about Granger Causality: “one commonly used method for inference with continuous-valued time series data is Granger”, the idea that if there is a time delay observed within a correlation, this may be a hint of causality. Judea Pearl warns us that this may be simply the case of a counfounder with asymmetric delays, but in practice the test of Granger causality is not a proof but a good indicator for causality. The proper wording is that this test is a good indicator for “predictive causality”. More generally, if predicting a value Y from the past of X up to a non-null delay does a good job, it may be said that there is a good chance of “predictive causality” from X to Y. This links the tool of forecasting to our goal of causality hunting. It is an interesting tool since it may be used with non-linear models (contrary to Granger Causality) and multi-variate analysis. If we start from a causal diagram in the Pearl’s sense, we may see if the root nodes (the hypothetical causes) may be used successfully to predict the future of the target nodes (the hypothetical “effects”). This is, in a nutshell, how the Knomee mobile app operates: it collects data associated to a causal diagram and uses forecasting as a possible indicator of “predictive causality”.
The search of “why” with self-tracking data is quite interesting because most values (heart rate, mood, weight, number of steps, etc.) are nonstationary on a short time scale, but bounded on a long-time horizon while exhibiting a lot of daily variation. This makes detecting patterns more difficult since this is quite different from extrapolating the movement of a predator for its previous positions (another short time series). We are much better at “understanding” patterns that derive from linear relations than those that emerge from complex causality loops with delays. The analysis of delays between two observations (at the heart of the Granger Causality) is also a key tool in complex system analysis. We must, therefore, bring it with us when hunting for causality. This is why the Knomee app includes multiple correlation/delay analysis to confirm or invalidate the causal hypothesis.

A few other pearls of wisdom about causality hunting with self-tracking may be found in the book from Gina Neff and Dawn Nafus. This reference book on quantified self and self-tracking crosses a number of ideas that we have already exposed, such as the critical importance of the user in the tracking and learning process. Self-tracking – a practice which is both very ancient and has shown value repeatedly – is usually boring if no sense is derived from the experiment. Making sense is either positive, such as finding causality, or negative, such as disproving a causality hypothesis. Because we can collect data more efficiently in the digital world, the quest for sense is even more important: Sometimes our capacity to gather data outpaces our ability to make sense of it”.  In the first part of this book we find this statement which echoes nicely the principles of Judea Pearl: “A further goal of this book is to show how self-experimentation with data forces us to wrestle with the uncertain line between evidence and belief, and how we come to decisions about what is and is not legitimate knowledge”.  We have talked about small data and short time-series from the beginning because experience shows that most users collect data over long period of time: “Self-tracking projects should start out as brief experiments that are done, say, over a few days or a few weeks. While there are different benefits to tracking over months or years, a first project should not commit you for the long haul”.  This is why we shall focus in the next section on algorithms that can work robustly with a small amount of data.
Self-tracking is foremost a learning experiment: “The norm within QS is that “good” self-tracking happens when some learning took place, regardless of what kind of learning it was”. A further motive for self-tracking is often behavioural change, which is also a form of self-learning. A biologists tell us, learning is most often associated with pleasure and reward. As pointed out in a previous post, there is a continuous cycle : pleasure to desire to plan to action to pleasure, that is a common foundation for most learning with living creatures. Therefore, there is a dual dependency between pleasure and learning when self-tracking: one must learn (make sense out the collected data) to stay motivated and to pursue the self-tracking experience (which is never very long) and this experience should reward the user from some forms of pleasure, from surprise and fun to the satisfaction of learning something about yourself.

Forecasting is a natural part of the human learning process. We constantly forecast what will happen and learn by reacting to the difference. As explained by Michio Kaku, our sense of humour and the pleasure that we associate with surprises is a Darwinian mechanism to push us to constantly improve our forecasting (and modelling abilities). We forecast continuously, we experience the reality and we enjoy the surprise (the difference between what happens and what we expect) as an opportunity to learn in a Bayesian way, that is to reassign our prior assumptions (our model of the world). The importance of curiosity as a key factor for learning is now widely accepted in the machine learning community as illustrated in this ICML 2017 paper: “Curiosity-driven Exploration by Self-supervised Prediction”. The role of surprise and fun in learning is another reason to be interested in forecasting algorithms. Forecasting the future, even if unreliable, creates positive emotions around self-tracking. This is quite general: we enjoy forecasts, which we see as games (in addition of their intrinsic value) – one can think of sports or politics as example. A self-tracking forecasting algorithm that does a decent job (i.e., not too wrong nor too often) works in a way similar to our brain: it is invisible but acts as a time saver most of the times, and when wrong it signals a moment of interest. We shall now come back to the topic of forecasting algorithms for short time-series, since we have established that they could play an interesting role for causality hunting.

4. Machine Generation of Robust Algorithms


Our goal in this last section is to look at the design of robust algorithms for short time series forecasting. Let us first define what I mean by robust, which will explain the metaphor which was proposed in the introduction. The following figure is extracted from my ROADEF presentation, it represents two possible types of “quests” (causal diagrams). Think of a quest as a variable that we try to analyse, together with other variables (the “factors”) which we think might explain the main variable. The vertical axis represents a classification of the variation that is observed into three categories: the random noise in red, the variation that is due to factors that were not collected in the sample in orange, and the green area is the part that we may associate with the factors. A robust algorithm is a forecasting algorithm that accepts an important part of randomness, to the point that many quests are “pointless” (remember the “Turing test of incomplete forecasting”). A robust algorithm should be able to exploit the positive influence of the factors (in green) when and if it exists. The picture makes it clear that we should not expect miracles: a good forecasting algorithm can only improve by a few percent over the simple prediction of the average values. What is actually difficult is to design an algorithm that is not worse – because of overfitting – than average prediction when given a quasi-random input (right column on the picture).



As the title of the section suggests, I have experimented with machine generation of forecasting algorithms. This technique is also called meta-programming: a first algorithm produces code that represents a forecasting algorithm. I have used this approach many times in the past decades, from complex optimization problems to evolutionary game theory. I found that it was interesting many years ago when working on TV audience forecasting, because it is a good way to avoid over-fitting, which is a common plague when doing machine learning over a small data set, and to control the robustness properties thanks to evolutionary meta-techniques. The principle is to create a term algebra that represents instantiations and combinations of simpler algorithm. Think of it as a tool box. One lever of control (robustness and over-fitting) is to make sure that you only select “robust tools” to put in the box. This means that you may not obtain the best or more complex machine learning algorithm such as deep learning, but you ensure both “explainability” and control. The meta-algorithm is an evolutionary randomised search algorithm (similar to the Monte-Carlo Tree Search of Alpha Zero) that may be sophisticated (using genetic combinations of terms) or simple (which is what we use for short time series).

The forecasting algorithm used by the Knomee app is produced locally on the user phone from the collected data. To test robustness, we have collected self-tracking data over the past two years - for those of you who are curious to apply other techniques, the data is available on GitHub. The forecasting algorithm is the fixed-point of an evolutionary search. This is very similar to reinforcement learning in the sense that each iteration is directed by a fitness function that describes the accuracy of the forecasting (modulo regularization, as explained in the presentation). The training protocol is defined as running the resulting forecasting algorithm on each sample of the data set (a quest) and for each time position from 2/3 to 3/3 of the ordered time series. In other words, the score that we use is the average precision of the forecasting that a user would experience in the last third of the data collection process. The term-algebra that is used to represent and to generate forecasting algorithms is made of simple heuristics such as regression and movingAverage, of weekly and hourly time patterns, and correlation analysis with threshold, cumulative and delays options. With the proper choice of meta-parameters to tune the evolutionary search (such as the fitness function or the depth and scope of local optimisation), this approach is able to generate a robust algorithm, that is : (1) that generates better forecasts than average (although not by much) (2) that is not thrown away by pseudo-random time series . Let me state clearly that this approach is not a “silver bullet”. I have compared the algorithm produced by this evolutionary search with the classical and simple machine learning approaches that one would use for time series: Regression, k-means clustering and ARMA. I refer you to the great book “Machine Learning for the Quantified Self” by M. Hoogendoorn and B. Funk for a complete survey on how to use machine learning with self-tracking data. On regular data (such as sales time series), the classical algorithms perform slightly better that evolutionary code generation. However, when real self-tracking data is used with all its randomness, evolutionary search manages to synthesise robust algorithms, which none of the three classical algorithms are. 

5. Conclusion



This topic is more complex than many of the subjects that I address here. I have tried to stay away from the truly technical aspects, at the expense of scientific precision. I will conclude this post with a very short summary:
  1. Causality hunting is a fascinating topic. As we accumulate more and more data, and as Artificial Intelligence tools become more powerful, it is quite logical to hunt for causality and to build models that represent a fragment of our world knowledge through machine learning. This is, for instance, the heart of the Causality Link startup led by my friend Pierre Haren, which builds automatically knowledge graphs from textual data while extracting causal links, which is then use for deep situation analysis with scenarios.
  2. Causality hunting is hard, especially with small data and even more with “Quantified Self” data, because of the random nature of many of the time series that are collected with connected devices. It is also hard because we cannot track everything and quite often what we are looking for depends on other variable (the orange part of the previous picture).
  3. Forecasting is an interesting tool for causality hunting. This is counter-intuitive since forecasting is close to impossible with self-tracking data. A better formulation should be: “ a moderate amount of robust forecasting may help with causality hunting". Forecasting gives a hint of “predictive causality”, in the sense of the Granger causality, and it also serves to enrich the pleasure-surprise-discovery learning loop of self-tracking.
  4. Machine code generation through reinforcement learning is a powerful technique for short time-series forecasting. Code generating algorithms try to assemble building blocks from a given set to match a given output. When applied to self-tracking forecasting, this technique allows to craft algorithms that are robust to random noise (to recognise the data as such) and able to extract a weak correlative signal from a complex (although short) data set.




 
Technorati Profile