Goals:
Consider the following DGP with linear heterogeneous treatment effects:
\(p=10\) independent covariates \(X_1,...,X_k,...,X_{10}\) drawn from a uniform distribution: \(X_k \sim uniform(-\pi,\pi)\)
The treatment model is \(W \sim Bernoulli(\underbrace{\Phi(sin(X_1))}_{e(X)})\), where \(\Phi(\cdot)\) is the standard normal cumulative density function
The potential outcome model of the controls is \(Y(0) = \underbrace{sin(X_1)}_{m_0(X)} + \varepsilon\), with \(\varepsilon \sim N(0,1)\)
The CATE function is a linear function of the first three covariates \(\tau(X) = \underbrace{0.3}_{\rho_1} X_1 + \underbrace{0.2}_{\rho_2} X_2 + \underbrace{0.1}_{\rho_3} X_3\)
The potential outcome model of the treated is \(Y(1) = m_0(X) + \tau(X) + \varepsilon\), with \(\varepsilon \sim N(0,1)\)
This leads to substantial effect heterogeneity driven by the first three covariates:
if (!require("grf")) install.packages("grf", dependencies = TRUE); library(grf)
if (!require("tidyverse")) install.packages("tidyverse", dependencies = TRUE); library(tidyverse)
if (!require("patchwork")) install.packages("patchwork", dependencies = TRUE); library(patchwork)
if (!require("estimatr")) install.packages("estimatr", dependencies = TRUE); library(estimatr)
if (!require("np")) install.packages("np", dependencies = TRUE); library(np)
if (!require("crs")) install.packages("crs", dependencies = TRUE); library(crs)
if (!require("causalDML")) {
if (!require("devtools")) install.packages("devtools", dependencies = TRUE); library(devtools)
install_github(repo="MCKnaus/causalDML")
}; library(causalDML)
set.seed(1234)
# Set parameters
n = 1000
p = 10
rho = c(0.3,0.2,0.1,rep(0,7))
# Draw sample
x = matrix(runif(n*p,-pi,pi),ncol=p)
e = function(x){pnorm(sin(x))}
m0 = function(x){sin(x)}
tau = x %*% rho
w = rbinom(n,1,e(x[,1]))
y = m0(x[,1]) + w*tau + rnorm(n,0,1)
hist(tau)
Consider that we are interested in the heterogeneity with respect to the first five covariates, \(X_1\) to \(X_5\).
We draw a sample of \(N=1000\) and estimate the nuisance parameters \(e(X)=E[W|X]\), \(m(0,X)=E[Y|W=0,X]\) and \(m(1,X)=E[Y|W=1,X]\) using honest random forest with self-tuned tuning parameters and all ten covariates.
We first handcode 2-fold cross-validation and plug the resulting nuisance parameters into the pseudo outcome \[\tilde{Y}_{ATE} = \underbrace{\hat{m}(1,X) - \hat{m}(0,X)}_{\text{outcome predictions}} + \underbrace{\frac{W (Y - \hat{m}(1,X))}{\hat{e}(X)} - \frac{(1-W) (Y - \hat{m}(0,X))}{1-\hat{e}(X)}}_{\text{weighted residuals}}\]
# 2-fold cross-fitting
m0hat = m1hat = ehat = rep(NA,n)
# Draw random indices for sample 1
index_s1 = sample(1:n,n/2)
# Create S1
x1 = x[index_s1,]
w1 = w[index_s1]
y1 = y[index_s1]
# Create sample 2 with those not in S1
x2 = x[-index_s1,]
w2 = w[-index_s1]
y2 = y[-index_s1]
# Model in S1, predict in S2
rf = regression_forest(x1,w1,tune.parameters = "all")
ehat[-index_s1] = predict(rf,newdata=x2)$predictions
rf = regression_forest(x1[w1==0,],y1[w1==0],tune.parameters = "all")
m0hat[-index_s1] = predict(rf,newdata=x2)$predictions
rf = regression_forest(x1[w1==1,],y1[w1==1],tune.parameters = "all")
m1hat[-index_s1] = predict(rf,newdata=x2)$predictions
# Model in S2, predict in S1
rf = regression_forest(x2,w2,tune.parameters = "all")
ehat[index_s1] = predict(rf,newdata=x1)$predictions
rf = regression_forest(x2[w2==0,],y2[w2==0],tune.parameters = "all")
m0hat[index_s1] = predict(rf,newdata=x1)$predictions
rf = regression_forest(x2[w2==1,],y2[w2==1],tune.parameters = "all")
m1hat[index_s1] = predict(rf,newdata=x1)$predictions
# Generate pseudo-outcome
pseudo_y = m1hat - m0hat +
w*(y-m1hat) / ehat - (1-w)*(y-m0hat) / (1-ehat)
We know that the unconditional mean of the pseudo-outcome estimates the ATE. Now, we use it as pseudo-outcome in a multivariate regression using the five heterogeneity variables as covariates:
lm_fit2 = lm_robust(pseudo_y~x[,1:5])
summary(lm_fit2)
Call:
lm_robust(formula = pseudo_y ~ x[, 1:5])
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.118936 0.06886 1.7271 8.445e-02 -0.016198 0.25407 994
x[, 1:5]1 0.234405 0.04041 5.8009 8.860e-09 0.155109 0.31370 994
x[, 1:5]2 0.213947 0.04082 5.2417 1.943e-07 0.133850 0.29404 994
x[, 1:5]3 0.077709 0.03862 2.0122 4.447e-02 0.001925 0.15349 994
x[, 1:5]4 0.007174 0.03726 0.1925 8.474e-01 -0.065948 0.08030 994
x[, 1:5]5 0.018603 0.03633 0.5121 6.087e-01 -0.052682 0.08989 994
Multiple R-squared: 0.06936 , Adjusted R-squared: 0.06468
F-statistic: 13.15 on 5 and 994 DF, p-value: 2.005e-12
se2 = lm_fit2$std.error
data.frame(Variable = c("Constant",paste0("X",1:5)),
Coefficient = lm_fit2$coefficients,
cil = lm_fit2$coefficients - 1.96*se2,
ciu = lm_fit2$coefficients + 1.96*se2,
truth = c(0,rho[1:5])) %>%
ggplot(aes(x=Variable,y=Coefficient,ymin=cil,ymax=ciu)) + geom_point(size=2.5,aes(colour="Estimate",shape="Estimate")) + geom_errorbar(width=0.15) +
geom_hline(yintercept=0) + geom_point(aes(x=Variable,y=truth,colour="Truth",shape="Truth"),size=2.5) +
scale_colour_manual(name="Legend", values = c("black","blue")) +
scale_shape_manual(name="Legend",values = c(19,8))
The estimated coefficients are close to the true ones and the true coefficients are covered by the respective 95% confidence intervals.
Thus, it is also not surprising that the correlation of the resulting fitted values with the true CATE is obvious and strong:
plot(tau,lm_fit2$fitted.values)
causalDML
packageNext we consider 5-fold cross-fitting using the
causalDML
package. This requires to first estimate the
standard average effects:
# 5-fold cross-fitting with causalDML package
# Create learner
forest = create_method("forest_grf",args=list(tune.parameters = "all"))
# Run
aipw = DML_aipw(y,w,x,ml_w=list(forest),ml_y=list(forest),cf=5)
summary(aipw$APO)
APO SE
0 -0.03810188 0.05205454
1 0.06029791 0.06423154
summary(aipw$ATE)
ATE SE t p
1 - 0 0.09840 0.07321 1.3441 0.1792
It stores the pseudo-outcome in the created object
(object$ATE$delta
) such that we can use it in the next step
without rerunning the ML steps again:
lm_fit5 = lm_robust(aipw$ATE$delta~x[,1:5])
summary(lm_fit5)
Call:
lm_robust(formula = aipw$ATE$delta ~ x[, 1:5])
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.100617 0.07027 1.4318 1.525e-01 -0.037284 0.23852 994
x[, 1:5]1 0.244793 0.04073 6.0102 2.599e-09 0.164867 0.32472 994
x[, 1:5]2 0.231625 0.04135 5.6014 2.751e-08 0.150479 0.31277 994
x[, 1:5]3 0.079803 0.03940 2.0256 4.308e-02 0.002491 0.15711 994
x[, 1:5]4 0.008015 0.03796 0.2112 8.328e-01 -0.066472 0.08250 994
x[, 1:5]5 0.023393 0.03736 0.6261 5.314e-01 -0.049922 0.09671 994
Multiple R-squared: 0.07435 , Adjusted R-squared: 0.0697
F-statistic: 14.41 on 5 and 994 DF, p-value: 1.215e-13
se5 = lm_fit5$std.error
data.frame(Variable = c("Constant",paste0("X",1:5)),
Coefficient = lm_fit5$coefficients,
cil = lm_fit5$coefficients - 1.96*se5,
ciu = lm_fit5$coefficients + 1.96*se5,
truth = c(0,rho[1:5])) %>%
ggplot(aes(x=Variable,y=Coefficient,ymin=cil,ymax=ciu)) + geom_point(linewidth=2.5,aes(colour="Estimate",shape="Estimate")) + geom_errorbar(width=0.15) +
geom_hline(yintercept=0) + geom_point(aes(x=Variable,y=truth,colour="Truth",shape="Truth"),linewidth=2.5) +
scale_colour_manual(name="Legend", values = c("black","blue")) +
scale_shape_manual(name="Legend",values = c(19,8))
Warnung: Ignoring unknown parameters: `linewidth`Warnung: Ignoring unknown parameters: `linewidth`
Again the predicted CATEs and the true ones are highly correlated:
plot(tau,lm_fit5$fitted.values)
Remark: This procedure would estimate the Best Linear
Predictor of the CATE with respect to the five variables in the (likely)
case that the underlying CATE function is not really linear.
Now we revisit the second DGP of notebook SNB_AIPW_DML with zero ATE but highly nonlinear effect heterogeneity:
\(p=10\) independent covariates \(X_1,...,X_k,...,X_{10}\) drawn from a uniform distribution: \(X_k \sim uniform(-\pi,\pi)\)
The treatment model is \(W \sim Bernoulli(\underbrace{\Phi(sin(X_1))}_{e(X)})\), where \(\Phi(\cdot)\) is the standard normal cumulative density function
The outcome model of the controls is \(Y(0) = \underbrace{cos(X_1+1/2\pi)}_{m_0(X)}+ \varepsilon\), with \(\varepsilon \sim N(0,1)\)
The outcome model of the treated is \(Y(1) = \underbrace{sin(X_1)}_{m_1(X)}+ \varepsilon\), with \(\varepsilon \sim N(0,1)\)
The treatment effect function is \(\tau(X) = sin(X_1) - cos(X_1+1/2\pi)\)
x = matrix(runif(n*p,-pi,pi),ncol=p)
e = function(x){pnorm(sin(x))}
m1 = function(x){sin(x)}
m0 = function(x){cos(x+1/2*pi)}
tau = function(x){m1(x) - m0(x)}
w = rbinom(n,1,e(x[,1]))
y = w*m1(x[,1]) + (1-w)*m0(x[,1]) + rnorm(n,0,1)
g1 = data.frame(x = c(-pi, pi)) %>% ggplot(aes(x)) + stat_function(fun=e,linewidth=1) + ylab("e") + xlab("X1")
g2 = data.frame(x = c(-pi, pi)) %>% ggplot(aes(x)) + stat_function(fun=m1,linewidth=1,aes(colour="Y1")) +
stat_function(fun=m0,linewidth=1,aes(colour="Y0")) + ylab("Y") + xlab("X1")
g3 = data.frame(x = c(-pi, pi)) %>% ggplot(aes(x)) + stat_function(fun=tau,linewidth=1) + ylab(expression(tau)) + xlab("X1")
g1 / g2 / g3
Consider we are interested in the heterogeneity with respect to the first covariate \(X_1\).
We draw a sample of \(N=1000\) and estimate the nuisance parameters \(e(X)=E[W|X]\), \(m(0,X)=E[Y|W=0,X]\) and \(m(1,X)=E[Y|W=1,X]\) using honest random forest with self-tuned tuning parameters and all ten covariates.
We first handcode 2-fold cross-validation and plug the resulting nuisance parameters into the pseudo outcome formula:
m0hat = m1hat = ehat = rep(NA,n)
# Draw random indices for sample 1
index_s1 = sample(1:n,n/2)
# Create S1
x1 = x[index_s1,]
w1 = w[index_s1]
y1 = y[index_s1]
# Create sample 2 with those not in S1
x2 = x[-index_s1,]
w2 = w[-index_s1]
y2 = y[-index_s1]
# Model in S1, predict in S2
rf = regression_forest(x1,w1,tune.parameters = "all")
ehat[-index_s1] = predict(rf,newdata=x2)$predictions
rf = regression_forest(x1[w1==0,],y1[w1==0],tune.parameters = "all")
m0hat[-index_s1] = predict(rf,newdata=x2)$predictions
rf = regression_forest(x1[w1==1,],y1[w1==1],tune.parameters = "all")
m1hat[-index_s1] = predict(rf,newdata=x2)$predictions
# Model in S2, predict in S1
rf = regression_forest(x2,w2,tune.parameters = "all")
ehat[index_s1] = predict(rf,newdata=x1)$predictions
rf = regression_forest(x2[w2==0,],y2[w2==0],tune.parameters = "all")
m0hat[index_s1] = predict(rf,newdata=x1)$predictions
rf = regression_forest(x2[w2==1,],y2[w2==1],tune.parameters = "all")
m1hat[index_s1] = predict(rf,newdata=x1)$predictions
# generate pseudo-outcome
pseudo_y = m1hat - m0hat +
w*(y-m1hat) / ehat - (1-w)*(y-m0hat) / (1-ehat)
Now we use the pseudo-outcome in a kernel regression with the only
covariate being \(X_1\). We
cross-validate the bandwidth of the kernel regression, run the
estimation with the np
package and plot the estimated
curve:
z = as.data.frame(x[,1])
# Crossvalidate bandwidth
bwobj = npregbw(ydat = pseudo_y, xdat = z, ckertype = 'gaussian', ckerorder = 2, regtype = 'lc', bwmethod = 'cv.ls')
bws = bwobj$bw
# Undersmoothing, i.e. chose a slightly smaller bandwidth than was cross-validated
bw = bwobj$bw * 0.9
cate_model = npreg(tydat = pseudo_y, txdat = z, bws=bw, ckertype = 'gaussian', ckerorder = 2, regtype = 'lc')
plot(cate_model)
This looks very similar to the true function.
causalDML
packageNext we consider 5-fold cross-fitting using the
causalDML
package. This requires to first estimate the
standard average effects:
# 5-fold cross-fitting with causalDML package
# Create learner
forest = create_method("forest_grf",args=list(tune.parameters = "all"))
# Run
aipw = DML_aipw(y,w,x,ml_w=list(forest),ml_y=list(forest),cf=5)
summary(aipw$APO)
APO SE
0 -0.01052922 0.05749346
1 0.08131356 0.05128989
summary(aipw$ATE)
ATE SE t p
1 - 0 0.091843 0.082444 1.114 0.2655
It stores the pseudo-outcome in the created object
(object$ATE$delta
) such that we can use it in the next step
without rerunning the ML steps again. The full procedure is implemented
in the kr_cate
function:
kernel_reg_x1 = kr_cate(aipw$ATE$delta,x[, 1])
plot(kernel_reg_x1)
We observe that the estimated curve fits quite nicely and the x-axis is included in the 95% confidence intervals where it should be \(\Rightarrow\) We find what we should find.
Finally, lets check whether we find something that is not there by checking heterogeneity with respect to \(X_2\).
kernel_reg_x2 = kr_cate(aipw$ATE$delta,x[, 2])
plot(kernel_reg_x2)
We find no evidence of heterogeneous effects along \(X_2\), as we should.
Consider again that we are interested in the heterogeneity with
respect to the first covariate \(X_1\).
We reuse first the handcoded 2-fold cross-fitted pseudo outcome in a
series regression with B-splines using the crs
package
spline_gate = crs(pseudo_y ~ as.matrix(z))
plot(spline_gate,mean=T)
Also the spline function nicely approximates the true function.
Again, we can reuse the pseudo-outcome
(object\$ATE\$delta
) and use the spline_cate
function of the causalDML
package:
spline_reg_x1 = spline_cate(aipw$ATE$delta,x[, 1])
plot(spline_reg_x1)
Again this looks like the true function. Most importantly, the x-axis is included in the 95% confidence intervals where it should be \(\Rightarrow\) We find what we should find.
Finally, lets check whether we find something that is not there by checking heterogeneity with respect to \(X_2\).
spline_reg_x2 = spline_cate(aipw$ATE$delta,x[, 2])
plot(spline_reg_x2)
We find no evidence of heterogeneous effects along \(X_2\), as we should.
Estimating heterogeneous effects with pre-specified heterogeneity variables is just a few lines of additional code reusing the pseudo-outcome from the average effect estimation
We can model effect sizes as we are used to modeling outcome levels by using the right pseudo-outcome
Some suggestions:
Increase/decrease the number of observations
Create a non-linear CATE in the first part
Change the treatment shares