First, let’s set the definitions straight: correlation describes a relationship between variables; causation indicates how one variable influences another. As a young single person, my grumpiness level on a plane (Variable 1) is highly correlated to the number of children on that plane (Variable 2). I know that the number of children does not cause my grumpiness levels (and my grumpiness level certainly does not cause the number of children). However, if I swap out the “the number of children” variable with “the decibel level of a child’s wails and suffering” variable, then I can now say with certainty that a high Variable 2 causes a high Variable 1.
It’s important to remember that correlation does not imply causation, and knowing the difference between the two is pivotal when drawing conclusions from a dataset. For example, on this website, the author shows that there is a 95% correlation (R squared) between per capita cheese consumption and the number of people who died by becoming tangled in their bedsheets. While this is an extremely high correlation, we know intuitively that these variables have no causal relationship. Without the latter, the usefulness of the former is null.
RSEG has been applying machine learning techniques to determine the primary drivers (or causes) of asset over- (or under-) performance for several years, and designing models that focus on causes over correlations is central to that work.
Let’s compare two different models – one built using ALL of the variables in our analytics-ready database (Figure 1) and another that uses the same well set but applies a degree of “domain knowledge” in picking the variables included in the model (Figure 2). The performance of the first model is apparently impressive with an R squared of 0.95, but this high R squared is a result of including variables that have high correlation to our target variable but no causal relationship. Now take Figure 2, which summarizes a model built using only causal variables for the production metric we want to predict. Even though the R squared is lower, this model is much more useful. At RSEG, we do not build these models to obtain a high correlation; they are simply one tool in our toolkit. And we know that the most useful tools help solve problems such as, “How should operator x space its wells?” or, “What are the most important geological variables?” These questions help our interdisciplinary teams of developers, engineers, data scientists, financial analysts and geologists focus what we are trying to achieve with our machine learning techniques – helping investors and operators make better decisions with more confidence.
FIGURE 1 | Model Fit With Correlated Features Included
FIGURE 2 | Model Fit With Only Causal Variables Included