Causal Inference: have you been doing science wrong all this time?

Benjamin Vincent — Sat, 29 Jul 2023 23:00:00 GMT

Introduction

If you are a researcher in academia or industry, depending where you are in your causal journey, this post may blow your mind! Its primary purpose is to convey that if you have not started thinking about causality yet, then you should perhaps start to feel deep existential dread.

Let’s say you are doing a social science study and you are interested in how a bunch of predictor variables (e.g. impulsivity, age, sex, etc.) influence some outcome variable such as body mass index. An exceedingly common practice is to throw all of these variables into a multiple linear regression.

When you do this you will get regression coefficients that inform you of the statistical associations between your predictors and outcome variables. The problem is that these are only sometimes useful in telling us what we really care about, namely the causal relationships between the variables.

Now, it’s entirely possible that you’ve gotten lucky and by running a multiple regression with a bunch of control variables you end up with a statistical estimate that is not too far away from a true causal effect.

Then again, it is entirely possible that you are not lucky, and by including one or more predictor variables in your regression you have actually introduced massive bias into your estimate of the causal effect. Maybe you think there is a causal effect but actually there is no causal effect at all. Or maybe there is a causal effect but it is in the opposite direction to what you think. Or maybe there is a causal effect but it is much larger or smaller than you think.

At this point you might be thinking either “I don’t believe you” or “I need to be convinced before I start to panic.” Well, that is what this post is about. By using a simulated dataset with a known causal structure, we will see how different approaches to estimating causal effects can lead to very different results.

We will see two approaches that can be used to estimate causal effects: (1) linear regression (with good controls included and none of the bad controls), and (2) structural causal models. The first approach is perhaps easier to implement because regressions are familiar, but the second approach is more powerful and automatic. But there is no magic pill however, both approaches require you to think carefully about the causal structure of your data, and there are no guarantees that you’ve imagined the correct causal structure. But by at least thinking about it we can do our best to avoid making egregious errors out of ignorance.

Generate a dataset

First we will generate some data from this causal DAG. Just to keep things simple, all the nodes are normally distributed and the relationships between the nodes are linear.

Figure 1: A causal DAG with four variables. We are primarily interested in the relationship between and . But there is a confounding variable which influences both and . There is also a collider which is influenced by and . This DAG is loosely based on Molak (2023, p 139).

Let’s generate a simulated dataset of 5,000 observations and look at the first 5. This is perhaps quite a lot of observations, but doing this means that we can be less worried about the estimation error and focus more on how good or bad our estimates are in the best case scenario of abundant data.

Code

Benjamin T. Vincent

Causal Inference: have you been doing science wrong all this time?

Introduction

Generate a dataset