Today we will focus on the classic exogeneity/selection-on-observables/unconfoundedness/no omitted variable bias/… setting where we are interested in the causal effect
of treatment \(D\)
on outcome \(Y\)
while controlling for all confounding variables \(X\)
I assume that you are all familiar with this setting. Usually OLS is introduced in such a setting. Historically new estimation methods are also introduce in this setting before extending the ideas to other research design like instrumental variables or difference-in-differences. Therefore it feels natural to introduce causal ML estimators in the “easy” unconfoundedness setting first.
Once we convinced ourselves (and hopefully others) that unconfoundedness is justified in our setting, the big question is how to estimate the causal effect. These are the estimators that we will consider and hand-code today:
using lm_robust()
manually replicated using the Frisch-Waugh representation
replicated within DoubleML
package
Parametric version hand-coded with OLS and Logit
DoubleML
packageDouble ML version with random forest via DoubleML
package
Parametric version hand-coded with OLS and Logit
DoubleML
packageDouble ML version with random forest via DoubleML
package
Implemented using grf
package
We use the data of the hdm
package. The data was used in
Chernozhukov
and Hansen (2004). Their paper investigates the effect of
participation in the employer-sponsored 401(k) retirement savings plan
(our \(D\)) on net assets (our \(Y\)) while controlling for ten confounders
(our \(X\)).
We use the data because it is publicly available and the programs run
fast enough for our purposes. However, we do not really care about the
application today (call ?pension
to see the variable
description).
library(ggplot2)
library(DoubleML)
library(mlr3)
library(grf)
library(hdm)
library(estimatr)
set.seed(1234) # for replicability
# Load data
data(pension)
# Outcome
Y = pension$net_tfa
# Treatment
D = pension$p401
# Create main effects matrix
X = model.matrix(~ 0 + age + db + educ + fsize + hown + inc + male + marr + pira + twoearn, data = pension)
Let’s start with canonical OLS to estimate a linear model:
\[ Y = \tau D + X'\beta + \varepsilon \]
Unfortunately the base R lm()
function provides no
robust standard errors, so I opt for the lm_robust()
function of the estimatr
package:
# OLS standard
ols_lm = lm_robust(Y ~ D + X)
summary(ols_lm)
Call:
lm_robust(formula = Y ~ D + X)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) -32701.099 4765.8379 -6.8616 7.219e-12 -4.204e+04 -23359.086 9903
D 11590.383 1812.4761 6.3948 1.680e-10 8.038e+03 15143.205 9903
Xage 630.113 55.6750 11.3177 1.628e-29 5.210e+02 739.247 9903
Xdb -4879.000 1300.4171 -3.7519 1.765e-04 -7.428e+03 -2329.918 9903
Xeduc -626.509 343.6506 -1.8231 6.832e-02 -1.300e+03 47.116 9903
Xfsize -1056.891 406.3236 -2.6011 9.306e-03 -1.853e+03 -260.414 9903
Xhown 859.466 973.8877 0.8825 3.775e-01 -1.050e+03 2768.485 9903
Xinc 0.916 0.1119 8.1880 2.984e-16 6.967e-01 1.135 9903
Xmale -985.766 1476.6786 -0.6676 5.044e-01 -3.880e+03 1908.825 9903
Xmarr 730.671 1681.1182 0.4346 6.638e-01 -2.565e+03 4026.005 9903
Xpira 28920.231 1799.5039 16.0712 2.153e-57 2.539e+04 32447.625 9903
Xtwoearn -19404.112 2557.3323 -7.5876 3.551e-14 -2.442e+04 -14391.220 9903
Multiple R-squared: 0.2352 , Adjusted R-squared: 0.2344
F-statistic: 101.3 on 11 and 9903 DF, p-value: < 2.2e-16
This provides the familiar output including the target parameter and all the distracting coefficients that we should not interpret causally.
The Frisch-Waugh Theorem tells us that a numerically equivalent procedure is to
Run regression \(Y = X \beta_Y+ U_Y\), get fitted values \(\hat{Y}\) to estimate outcome residuals \(\hat{U}_Y = Y - \hat{Y}\)
Run regression \(D = X \beta_D+ U_D\), get fitted values \(\hat{D}\) to estimate treatment residuals \(\hat{U}_D = D - \hat{D}\)
Run a final residual-on-residual regression \(\hat{U}_Y = \tau \hat{U}_D + \epsilon\) even without a constant to obtain the target parameter
Let’s see this in code:
# OLS Frisch Waugh
Dhat_ols = lm(D ~ X)$fitted.values
Yhat_ols = lm(Y ~ X)$fitted.values
Dres_ols = D - Dhat_ols
Yres_ols = Y - Yhat_ols
ols_hand = lm_robust(Yres_ols ~ 0 + Dres_ols)
summary(ols_hand)
Call:
lm_robust(formula = Yres_ols ~ 0 + Dres_ols)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
Dres_ols 11590 1809 6.407 1.548e-10 8045 15136 9914
Multiple R-squared: 0.00744 , Adjusted R-squared: 0.00734
F-statistic: 41.05 on 1 and 9914 DF, p-value: 1.548e-10
We indeed obtain the identical point estimate as above and only a marginally different standard errors.
Crucial take away: We can think about OLS as running a residual-on-residual regression.
For reasons that will be obvious later, we can run plain OLS also as
special case of Double Machine Learning. To this end, we leverage the
DoubleML
package that requires a bit higher setup costs.
However, I code everything for you in this notebook and you can take a
deep dive with a lot of explanations and examples in the DoubleML
documentation.
# Prepare the data
data_dml = dml_data = double_ml_data_from_matrix(X=X, y=Y, d=D)
# Specify the model using OLS as the learner for treatment and outcome
lrn_ols = lrn("regr.lm")
ols_dml = DoubleMLPLR$new(data_dml,
ml_l = lrn_ols,
ml_m = lrn_ols,
n_folds = 1,
apply_cross_fitting = FALSE)
ols_dml$fit()
INFO [12:35:35.857] [mlr3] Applying learner 'regr.lm' on task 'nuis_l' (iter 1/1)
INFO [12:35:36.080] [mlr3] Applying learner 'regr.lm' on task 'nuis_m' (iter 1/1)
print(ols_dml)
================= DoubleMLPLR Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
Instrument(s):
No. Observations: 9915
------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml1
------------------ Machine learner ------------------
ml_l: regr.lm
ml_m: regr.lm
------------------ Resampling ------------------
No. folds: 1
No. repeated sample splits: 1
Apply cross-fitting: FALSE
------------------ Fit summary ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 11590 1809 6.408 1.47e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We obtain exactly the same results as above. You do not believe me? Let’s formally test whether the point estimates of the implementations are identical:
cat("lm_robust() and DoubleML identical?", all.equal(as.numeric(ols_lm$coefficients[2]),as.numeric(ols_dml$coef)))
lm_robust() and DoubleML identical? TRUE
Not so serious take away: You are already using Double ML since the first time you ran OLS.
Double ML is a very general framework. We introduce it using the canonical leading examples partially linear regression and augmented inverse probability weighting.
We introduce the partially linear outcome model \[ Y = \tau D + g(X) + \varepsilon \]
where the controls do not enter in a linear fashion but as a nonparametric function \(g(X)\) because why should the world be linear? The remaining restriction is that the causal effect is assumed to be homogeneous and not allowed to vary with \(X\).
This model underlies the partially linear regression (PLR) of Robinson (1988) and Chernozhukov et al. (2018) that is implemented using the following steps:
Predict the outcome \(\hat{Y} = \hat{E}[Y|X]\) with a suitable method and obtain outcome residuals \(\hat{U}_Y = Y - \hat{Y}\)
Predict the treatment \(\hat{D} = \hat{E}[D|X]\) with a suitable method and obtain treatment residuals \(\hat{U}_D = D - \hat{D}\)
Run a final residual-on-residual regression \(\hat{U}_Y = \tau \hat{U}_D + \epsilon\) even without a constant to obtain the target parameter
We call the outcome and treatment predictions the “nuisance parameters” because we are not interested in them per se but they are needed to get our hands on the “target parameter” \(\tau\).
We now immediately see that OLS is just a special case of partially linear regression where outcome and treatment are predicted using OLS. This explains why we can run plain OLS in the DoubleML infrastructure.
In our application, we have a binary treatment. We can therefore use, e.g. standard logistic regression to form the predictions and the residuals of the treatment.
# Parametric PLR hand-coded
logit_model = glm(D ~ X, family = binomial())
Dhat_logit = predict(logit_model, type = "response")
Dres_logit = D - Dhat_logit
We then run our first “official” partially linear regression where we recycle the OLS outcome residuals from above and use the newly obtained logit treatment residuals
plr_logit_hand = lm_robust(Yres_ols ~ 0 + Dres_logit)
summary(plr_logit_hand)
Call:
lm_robust(formula = Yres_ols ~ 0 + Dres_logit)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
Dres_logit 11289 1780 6.342 2.362e-10 7800 14778 9914
Multiple R-squared: 0.007096 , Adjusted R-squared: 0.006996
F-statistic: 40.23 on 1 and 9914 DF, p-value: 2.362e-10
We will compare the magnitudes of all estimators at the end below and for now focus on the implementation.
Double Machine Learning is doing exactly the same thing if we tell it to use OLS for outcome prediction and logit for treatment prediction:
# Parametric PLR DoubleML
lrn_logit = lrn("classif.log_reg", predict_type = "prob")
plr_logit_dml = DoubleMLPLR$new(data_dml, ml_l = lrn_ols, ml_m = lrn_logit, n_folds = 1, apply_cross_fitting = FALSE)
plr_logit_dml$fit()
INFO [12:35:36.372] [mlr3] Applying learner 'regr.lm' on task 'nuis_l' (iter 1/1)
INFO [12:35:36.475] [mlr3] Applying learner 'classif.log_reg' on task 'nuis_m' (iter 1/1)
print(plr_logit_dml)
================= DoubleMLPLR Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
Instrument(s):
No. Observations: 9915
------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml1
------------------ Machine learner ------------------
ml_l: regr.lm
ml_m: classif.log_reg
------------------ Resampling ------------------
No. folds: 1
No. repeated sample splits: 1
Apply cross-fitting: FALSE
------------------ Fit summary ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 11289 1780 6.343 2.25e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Again let’s check whether the point estimates are actually identical:
all.equal(as.numeric(plr_logit_hand$coefficients),as.numeric(plr_logit_dml$coef))
[1] TRUE
Of course the cool recent development is that the Double ML framework shows that we can apply machine learning methods to form the predictions that are used in the residuals for the residual-on-residual regression. This helps us to avoid the (at least in my opinion) most annoying part of estimating effects, the model specification. Instead, we outsource this task to powerful ML methods.
For example, we can use random forests to form the predictions. I like random forests because they are build to find interactions and nonlinearities in a data-driven manner and do not require to commit to a dictionary like for Lasso. However, the recipe is generic and you can decide which prediction algorithm is most suitable for your application.
So let’s see how we can implement PLR with random forest (should run within one minute):
# PLR using Random Forest
lrn_ranger = lrn("regr.ranger")
lrn_ranger_prob = lrn("classif.ranger")
plr_ranger_dml = DoubleMLPLR$new(dml_data, ml_l=lrn_ranger, ml_m=lrn_ranger)
plr_ranger_dml$fit(store_predictions=TRUE)
INFO [12:35:36.833] [mlr3] Applying learner 'regr.ranger' on task 'nuis_l' (iter 1/5)
INFO [12:35:39.334] [mlr3] Applying learner 'regr.ranger' on task 'nuis_l' (iter 2/5)
INFO [12:35:42.397] [mlr3] Applying learner 'regr.ranger' on task 'nuis_l' (iter 3/5)
INFO [12:35:44.844] [mlr3] Applying learner 'regr.ranger' on task 'nuis_l' (iter 4/5)
INFO [12:35:47.312] [mlr3] Applying learner 'regr.ranger' on task 'nuis_l' (iter 5/5)
INFO [12:35:49.860] [mlr3] Applying learner 'regr.ranger' on task 'nuis_m' (iter 1/5)
INFO [12:35:51.796] [mlr3] Applying learner 'regr.ranger' on task 'nuis_m' (iter 2/5)
INFO [12:35:53.733] [mlr3] Applying learner 'regr.ranger' on task 'nuis_m' (iter 3/5)
INFO [12:35:55.662] [mlr3] Applying learner 'regr.ranger' on task 'nuis_m' (iter 4/5)
INFO [12:35:57.563] [mlr3] Applying learner 'regr.ranger' on task 'nuis_m' (iter 5/5)
print(plr_ranger_dml)
================= DoubleMLPLR Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
Instrument(s):
No. Observations: 9915
------------------ Score & algorithm ------------------
Score function: partialling out
DML algorithm: dml2
------------------ Machine learner ------------------
ml_l: regr.ranger
ml_m: regr.ranger
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE
------------------ Fit summary ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 13669 1450 9.427 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Let’s replicate the DoubleML output manually to convince you that it
is really as easy as I claim. We told the DoubleMLPLR()
function to store the nuisance parameters by setting
store_predictions=TRUE
above.
This means we can extract the outcome and treatment predictions
Dhat_ranger = plr_ranger_dml$predictions$ml_m
Yhat_ranger = plr_ranger_dml$predictions$ml_l
and use them in a residual-on-residual regression
Dres_ranger = D - Dhat_ranger
Yres_ranger = Y - Yhat_ranger
plr_ranger_hand = lm_robust(Yres_ranger ~ 0 + Dres_ranger)
summary(plr_ranger_hand)
Call:
lm_robust(formula = Yres_ranger ~ 0 + Dres_ranger)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
Dres_ranger 13669 1450 9.426 5.217e-21 10826 16511 9914
Multiple R-squared: 0.01099 , Adjusted R-squared: 0.01089
F-statistic: 88.85 on 1 and 9914 DF, p-value: < 2.2e-16
Again the point estimates from package and hand-coded are identical:
all.equal(as.numeric(plr_ranger_dml$coef),as.numeric(plr_ranger_hand$coefficients))
[1] TRUE
Double ML with PLR assumed a homogeneous treatment effect. This might not be innocent and can lead to estimates being representative for counterintuitive populations if effects are actually heterogeneous (see e.g. Słoczyński (2022)).
Now we relax this assumptions and consider estimators that operate under the flexible/interactive outcome model \[ Y = g(D,X) + \varepsilon \]
Under this model, there is not the one \(\tau\) capturing the effect for everybody but the effect is allowed to depend on \(X\)s. Now we can articulate different target parameters but we focus on the canonical Average Treatment Effect (ATE) \(\tau_{ATE} = E[Y(1) - Y(0)]\).
There are three common strategies to estimate the ATE:
and only the latter allows the integration of ML via the Double ML framework.
Before moving too fast, let us first consider all three options in a standard parametric setting (we will focus on point estimates and ignore standard errors for the sake of simplicity).
Regression adjustment (RA) proceeds in the following way:
To get started let’s use OLS for the prediction:
# Regression adjustment / outcome imputation
ols0 = lm(Y ~ X, subset = (D == 0))
Y0hat_ols = predict(ols0, newdata = as.data.frame(X))
ols1 = lm(Y ~ X, subset = (D == 1))
Y1hat_ols = predict(ols1, newdata = as.data.frame(X))
cat("RA with OLS:",
mean(Y1hat_ols - Y0hat_ols))
RA with OLS: 9169.948
IPW follows this recipe
# IPW
ipw_weight1 = D / Dhat_logit
ipw_weight0 = (1 - D) / (1 - Dhat_logit)
Ytilde_ipw = (ipw_weight1 - ipw_weight0) * Y
cat("IPW with Logit:",
mean(Ytilde_ipw))
IPW with Logit: 7067.326
Augmented inverse probability weighting combines RA and IPW and creates the following pseudo-outcome
\[\begin{align*} \tilde{Y}^{aipw} & = \underbrace{\hat{Y}_1 - \hat{Y}_0}_{\text{RA}} + \underbrace{(w_1^{ipw} - w_0^{ipw}) Y}_{\text{IPW}} - \underbrace{(w_1^{ipw} \hat{Y}_1 - w_0^{ipw} \hat{Y}_0)}_{\text{debiasing term}} \\ &= \hat{Y}_1 - \hat{Y}_0 + \underbrace{w_1^{ipw} (Y - \hat{Y}_1) - w_0^{ipw} (Y - \hat{Y}_0)}_{\text{IPW weighted residuals}} \end{align*}\]
and takes its mean \(\hat{\tau}^{aipw} = \frac{1}{n} \sum_i \tilde{Y}^{aipw}\). One beautiful feature of AIPW is that the standard error of this mean is valid without adjusting for the estimation of the nuisance parameters.
This motivates the following recipe for AIPW:
We first recycle the OLS outcome predictions from RA and the logit treatment predictions from IPW:
# AIPW
Ytilde_parametric = Y1hat_ols - Y0hat_ols +
ipw_weight1 * (Y - Y1hat_ols) -
ipw_weight0 * (Y - Y0hat_ols)
aipw_parametric_hand = lm(Ytilde_parametric ~ 1)
summary(aipw_parametric_hand)
Call:
lm(formula = Ytilde_parametric ~ 1)
Residuals:
Min 1Q Median 3Q Max
-19861481 -24574 -1241 26577 3653559
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6976 2627 2.656 0.00792 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 261500 on 9914 degrees of freedom
We have handcoded AIPW with parametric estimators for the nuisance
parameters above. Now we establish that we could have done just the same
within the DoubleML
package by using the
DoubleMLIRM()
function (IRM stands for interactive
regression model and is another label for AIPW, sometimes AIPW is also
called the doubly robust estimators, all mean the same thing, so don’t
get confused).
# With DoubleML parametric
aipw_parametric_dml = DoubleMLIRM$new(dml_data, ml_g = lrn_ols, ml_m = lrn_logit,
n_folds = 1, apply_cross_fitting = FALSE)
aipw_parametric_dml$fit()
INFO [12:35:59.888] [mlr3] Applying learner 'classif.log_reg' on task 'nuis_m' (iter 1/1)
INFO [12:36:00.001] [mlr3] Applying learner 'regr.lm' on task 'nuis_g0' (iter 1/1)
INFO [12:36:00.062] [mlr3] Applying learner 'regr.lm' on task 'nuis_g1' (iter 1/1)
print(aipw_parametric_dml)
================= DoubleMLIRM Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
Instrument(s):
No. Observations: 9915
------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml1
------------------ Machine learner ------------------
ml_g: regr.lm
ml_m: classif.log_reg
------------------ Resampling ------------------
No. folds: 1
No. repeated sample splits: 1
Apply cross-fitting: FALSE
------------------ Fit summary ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 6976 2626 2.656 0.0079 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Again the handcoded and the DoubleML point estimates are identical:
all.equal(as.numeric(aipw_parametric_hand$coefficients),as.numeric(aipw_parametric_dml$coef))
[1] TRUE
Double ML for ATE estimation simply uses ML methods to estimate the nuisance parameters instead of OLS and logit. Like for PLR, we use random forest today:
# DoubleML random forest
aipw_ranger_dml = DoubleMLIRM$new(dml_data, ml_g = lrn_ranger, ml_m = lrn_ranger_prob)
aipw_ranger_dml$fit(store_predictions=TRUE)
INFO [12:36:00.188] [mlr3] Applying learner 'classif.ranger' on task 'nuis_m' (iter 1/5)
INFO [12:36:02.314] [mlr3] Applying learner 'classif.ranger' on task 'nuis_m' (iter 2/5)
INFO [12:36:04.328] [mlr3] Applying learner 'classif.ranger' on task 'nuis_m' (iter 3/5)
INFO [12:36:06.923] [mlr3] Applying learner 'classif.ranger' on task 'nuis_m' (iter 4/5)
INFO [12:36:09.093] [mlr3] Applying learner 'classif.ranger' on task 'nuis_m' (iter 5/5)
INFO [12:36:15.160] [mlr3] Applying learner 'regr.ranger' on task 'nuis_g0' (iter 1/5)
INFO [12:36:16.974] [mlr3] Applying learner 'regr.ranger' on task 'nuis_g0' (iter 2/5)
INFO [12:36:18.750] [mlr3] Applying learner 'regr.ranger' on task 'nuis_g0' (iter 3/5)
INFO [12:36:20.536] [mlr3] Applying learner 'regr.ranger' on task 'nuis_g0' (iter 4/5)
INFO [12:36:22.301] [mlr3] Applying learner 'regr.ranger' on task 'nuis_g0' (iter 5/5)
INFO [12:36:24.223] [mlr3] Applying learner 'regr.ranger' on task 'nuis_g1' (iter 1/5)
INFO [12:36:24.863] [mlr3] Applying learner 'regr.ranger' on task 'nuis_g1' (iter 2/5)
INFO [12:36:25.532] [mlr3] Applying learner 'regr.ranger' on task 'nuis_g1' (iter 3/5)
INFO [12:36:26.175] [mlr3] Applying learner 'regr.ranger' on task 'nuis_g1' (iter 4/5)
INFO [12:36:26.823] [mlr3] Applying learner 'regr.ranger' on task 'nuis_g1' (iter 5/5)
print(aipw_ranger_dml)
================= DoubleMLIRM Object ==================
------------------ Data summary ------------------
Outcome variable: y
Treatment variable(s): d
Covariates: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
Instrument(s):
No. Observations: 9915
------------------ Score & algorithm ------------------
Score function: ATE
DML algorithm: dml2
------------------ Machine learner ------------------
ml_g: regr.ranger
ml_m: classif.ranger
------------------ Resampling ------------------
No. folds: 5
No. repeated sample splits: 1
Apply cross-fitting: TRUE
------------------ Fit summary ------------------
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
d 10900 1159 9.408 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The rest is identical as we can see by exactly replicating the DoubleML output if we extract the nuisance parameters from the DoubleML object
# Extract nuisance parameters
Dhat_ranger = aipw_ranger_dml$predictions$ml_m
Y0hat_ranger = aipw_ranger_dml$predictions$ml_g0
Y1hat_ranger = aipw_ranger_dml$predictions$ml_g1
# Create pseudo outcome
Ytilde_ranger = Y1hat_ranger - Y0hat_ranger +
(D / Dhat_ranger) * (Y - Y1hat_ranger) -
(1 - D) / (1 - Dhat_ranger) * (Y - Y0hat_ranger)
# Run OLS with a constant
aipw_ranger_hand = lm(Ytilde_ranger ~ 1)
summary(aipw_ranger_hand)
Call:
lm(formula = Ytilde_ranger ~ 1)
Residuals:
Min 1Q Median 3Q Max
-2373530 -13163 -1835 13965 2920441
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10900 1159 9.407 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 115400 on 9914 degrees of freedom
The canonical check shows identical point estimates:
all.equal(as.numeric(aipw_ranger_hand$coefficients),as.numeric(aipw_ranger_dml$coef))
[1] TRUE
Double ML boils down to very simple recipes:
We implemented Double ML manually with standard parametric estimators and showed that calling DoubleML functions with the same parametric methods produces identical results.
We then used the DoubleML functions with random forests and showed that we could replicate the outputs manually.
This required few lines of code and hopefully convinced you that everything boils down to relatively simple recipes.
Let’s collect what we achieved.
estimator_names = c("OLS","DoubleML PLR Parametric","DoubleML PLR Forest",
"DoubleML AIPW parametric","DoubleML AIPW Forest")
point_estimates = c(ols_dml$coef,plr_logit_dml$coef,plr_ranger_dml$coef,
aipw_parametric_dml$coef,aipw_ranger_dml$coef)
standard_errors = c(ols_dml$se,plr_logit_dml$se,plr_ranger_dml$se,
aipw_parametric_dml$se,aipw_ranger_dml$se)
# Graph powered by ChatGPT
# Create a data frame
data <- data.frame(
Estimator = factor(estimator_names, levels = rev(estimator_names)), # Reverse the order for correct plotting
Estimate = point_estimates,
SE = standard_errors
)
# Calculate confidence intervals
data$Lower <- data$Estimate - 1.96 * data$SE
data$Upper <- data$Estimate + 1.96 * data$SE
# Plot
ggplot(data, aes(x = Estimator, y = Estimate, color = Estimator)) +
geom_errorbar(aes(ymin = Lower, ymax = Upper), width = 0.2, color = "gray40", linewidth = 0.5) + # Custom error bars
geom_point(size = 4) + # Larger points
geom_hline(yintercept = 0, linetype = "dashed", color = "red", linewidth = 0.5) + # More prominent horizontal line at zero
coord_flip() + # Flip coordinates to put estimators on the y-axis
labs(x = "Estimator", y = "Estimate", title = "Point Estimates with 95% Confidence Intervals") +
theme_light(base_size = 14) + # Lighter theme with larger base text size
theme(
plot.title = element_text(size = 16, face = "bold", color = "darkblue"), # Custom plot title
axis.title.x = element_text(size = 14, face = "bold"), # Custom x-axis title
axis.title.y = element_text(size = 14, face = "bold"), # Custom y-axis title
panel.grid.major = element_line(color = "gray80"), # Custom major grid lines
panel.grid.minor = element_blank(), # Remove minor grid lines
legend.position = "none" # Remove legend as color is mapped to Estimator but not needed
)
The differences in the point estimates are not striking. However, AIPW with random forest is by far the most precise estimator. This makes sense because AIPW is the efficient estimator in the likely case that effects are heterogeneous and random forest is expected to approximate the nuisance parameters much better than simple parametric models.
So far, we were interested in estimating an assumed homogeneous treatment effect \(\tau\) or the ATE. However, we might be interested in going beyond such summary measures and want to estimate conditional average treatment effects (CATEs)
\[ \tau(x) = E[Y(1) - Y(0)|X = x] \]
Causal forest as introduced by Athey et al. (2019) is
arguably the most prominent estimator to estimate such CATEs. They are
implemented in the grf
package and can be considered as
just running a weighted residual-on-residual regression
(this might not be obvious in the original paper but for an accessible
explanation see Athey and
Wager (2019)).
It might feel counterintuitive why a method for homogeneous effects can suddenly be used to estimate heterogeneous effects. Unfortunately, we have not the time to discuss why it actually makes sense. Instead, I show you that this is indeed what is going on.
First, run the default implementation of the
causal_forest()
and plot the distribution of estimated
CATEs:
cf = causal_forest(X,Y,D)
cate = predict(cf)$predictions
hist(cate)
We see that there is quite some heterogeneity but we will not take a closer look today. However, there are excellent tutorials what to do with these kinds of results on the grf homepage.
Today we are interested in understanding how these numbers come
about. I told you that grf is actually running a weighted RoRR. To see
this first extract the outcome and treatment predictions that are stored
in the causal_forest()
object and use them to produce
residuals:
Dhat_cf = cf$W.hat
Dres_cf = D - Dhat_cf
Yhat_cf = cf$Y.hat
Yres_cf = Y - Yhat_cf
They could be used to form the by now familiar PLR estimate:
plr_cf_hand = lm_robust(Yres_cf ~ 0 + Dres_cf)
summary(plr_cf_hand)
Call:
lm_robust(formula = Yres_cf ~ 0 + Dres_cf)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
Dres_cf 13783 1505 9.157 6.399e-20 10832 16733 9914
Multiple R-squared: 0.01111 , Adjusted R-squared: 0.01101
F-statistic: 83.85 on 1 and 9914 DF, p-value: < 2.2e-16
However, this is not what causal forests actually do. Causal forests produce an individualized estimate for every individual. To this end, they run RoRR but give higher weights to observations that are similar to the individual for which the CATE is estimated.
This weights - usually called \(\alpha\) - can be extracted via the
get_forest_weights()
function:
alpha = get_forest_weights(cf)
Let’s illustrate this by replicating the CATE for the first observation in the sample as an example, which is
cate1 = cate[1]
cate1
[1] 5716.159
We try to replicate this number in a weighted RoRR:
first_try = lm_robust(Yres_cf ~ 0 + Dres_cf, weight = alpha[1,])
summary(first_try)
Call:
lm_robust(formula = Yres_cf ~ 0 + Dres_cf, weights = alpha[1,
])
Weighted, Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
Dres_cf 5715 986.1 5.795 7.041e-09 3782 7648 9914
Multiple R-squared: 0.04641 , Adjusted R-squared: 0.04631
F-statistic: 33.58 on 1 and 9914 DF, p-value: 7.041e-09
and are slightly off :-(
This is because causal_forest()
uses a constant in the
RORR. Once we figured that out, we can perfectly replicate the
output:
cate1_hand = lm_robust(Yres_cf ~ Dres_cf, weight = alpha[1,])$coefficients[2]
cate1_hand
Dres_cf
5716.159
cat("\nEqual to grf output?",all.equal(as.numeric(cate1),as.numeric(cate1_hand)))
Equal to grf output? TRUE
You could repeat this for every observation by using the appropriate row of the alpha weight matrix.
Some remarks:
The results with and without a constant are usually very similar because the constant is zero asymptotically.
Unlike with Double ML, we can NOT use the standard errors of this output. Inference for causal forest CATEs is much more complicated and beyond the scope of this notebook.
My students in Tübingen programmed the Causal Forest Fun Shiny App to illustrate the different weighting in toy examples (have a look).
You see that things can be quite easy. If you want to take a deep dive, I collect here a variety of resources that can help you to understand the theoretical underpinning beyond the papers linked throughout the notebook above:
Causal ML book: Brand new textbook covering Double ML and much more
Causal Inference for The Brave and True, especially Part II: Accessible introduction
Stats 361 lecture notes of Stefan Wager: More technical material
The DoubleML and grf explain the theory behind the approaches and provide great implementation examples
Machine Learning for Economists a list maintained by Dario Sansone provides a nice overview of methodological work and application
My Causal ML teaching slides: If you like my style of teaching and notebooks like this, have a closer look