“Double Machine Learning based Program Evaluation under Unconfoundedness”

Michael C. Knaus

Version 2

This Notebook replicates the main results of the paper using a different seed (results in the paper are obtained with 1234, below 12345 is specified). The notebook should help to …

  • show the underlying code for researchers interested in replicating the analysis using the same data or adapting it to new datasets

  • show that most of the results reported in the paper are not sensitive to the seed used to initialize the random number generators that are called for defining the cross-fitting folds and within random forests

The explanations throughout the notebook are kept short. Please see the paper for more explanations regarding the parameters of interest and their Double Machine Learning based estimation.

Click on the “Code” button on the top right of this notebook to “Hide All Code” and see only the results or “Download Rmd” to extract the underlying code.

Running the analyses in this notebook took roughly two days on a SWITCHengine with eight cores and 32 GB RAM.



Preparation

The analysis is mostly based on the causalDML R-package that was developed in parallel to the paper. The preparation of the “Database_SALMP_MCK2020.Rdata” that is loaded can be replicated by requesting the data via FORSbase and running the data preparation file.

## Define and set seed
seed = 12345
set.seed(seed)

## Load relevant packages, in case you did not yet install causalDML uncomment the next two lines
# library(devtools)
# install_github(repo="MCKnaus/causalDML")
library(causalDML)
library(policytree)
library(tidyverse)
library(lmtest)
library(sandwich)
library(matrixStats)

## Set to path where data are stored for replication
# setwd("C:/Users/Administrator/switchdrive/Papers/Multi Treat Labor/Paper/Data")
load("Database_SALMP_MCK2020.RData")



Average effects

The causalDML command estimates first the cross-fitted nuisance parameters and uses them to calculate the double robust scores for APO and ATE, which are then used to estimate the APO and ATE(T).

# Create the ML method to be used for nuisance parameter prediction
forest = create_method("forest_grf",args=list(tune.parameters = "all",seed=seed))

# Run the main function that outputs nuisance parameters, APO and ATE
cDML = causalDML(y,w,x,ml_w=list(forest),ml_y=list(forest))



Average potential outcome (APO)

The following table and plot shows the DML estimated APOs.

summary(cDML$APO)
                APO         SE
no program 14.79627 0.05607573
job search 13.75765 0.11484011
vocational 18.01463 0.42204324
computer   18.03404 0.42602710
language   17.28424 0.37482108
plot(cDML$APO,label_w)



Average treatment effects (ATE and ATET)

ATE

The following table shows the results of all possible ATEs given the five different treatments:

summary(cDML$ATE)
                              ATE        SE       t         p    
job search - no program -1.038615  0.125267 -8.2912 < 2.2e-16 ***
vocational - no program  3.218358  0.425095  7.5709 3.757e-14 ***
computer - no program    3.237774  0.429024  7.5468 4.520e-14 ***
language - no program    2.487976  0.377771  6.5859 4.556e-11 ***
vocational - job search  4.256973  0.436936  9.7428 < 2.2e-16 ***
computer - job search    4.276389  0.440728  9.7030 < 2.2e-16 ***
language - job search    3.526591  0.391096  9.0172 < 2.2e-16 ***
computer - vocational    0.019416  0.599454  0.0324    0.9742    
language - vocational   -0.730382  0.563997 -1.2950    0.1953    
language - computer     -0.749798  0.566864 -1.3227    0.1859    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


ATET

The following tables show the results for all possible ATETs given the five different treatments. The first line shows for example an effect of -1.045 for being in a job search program vs. no program for those who were actually observed in no program, or formally adapting the notation of the paper \(E[Y(jo b~search) - Y(no~program) \mid W = no~program]\). As we can condition on all five programs, we get five tables:

# Estimate ATETs for the programs
atets = matrix(NA,4,2)
for (i in 1:5) {
  print(label_w[i])
  APO_atet = APO_dml_atet(y,cDML$APO$m_mat,cDML$APO$w_mat,cDML$APO$e_mat,cDML$APO$cf_mat,treated=i)
  ATET = ATE_dml(APO_atet)
  summary(ATET)
  atets[i-1,] = ATET$results[i-1,1:2]
}
[1] "No program"
                             ATET        SE       t         p    
job search - no program -1.061727  0.130743 -8.1207 4.717e-16 ***
vocational - no program  3.250641  0.425757  7.6350 2.291e-14 ***
computer - no program    3.195185  0.432797  7.3826 1.571e-13 ***
language - no program    2.449338  0.383402  6.3884 1.688e-10 ***
vocational - job search  4.312368  0.438981  9.8236 < 2.2e-16 ***
computer - job search    4.256912  0.445830  9.5483 < 2.2e-16 ***
language - job search    3.511065  0.398179  8.8178 < 2.2e-16 ***
computer - vocational   -0.055456  0.602127 -0.0921    0.9266    
language - vocational   -0.801303  0.568047 -1.4106    0.1584    
language - computer     -0.745847  0.573260 -1.3011    0.1932    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "Job search"
                            ATET       SE       t         p    
job search - no program -0.98007  0.12116 -8.0891 6.116e-16 ***
vocational - no program  3.02888  0.47650  6.3565 2.078e-10 ***
computer - no program    3.43079  0.46849  7.3231 2.452e-13 ***
language - no program    2.75514  0.42315  6.5110 7.520e-11 ***
vocational - job search  4.00895  0.48375  8.2873 < 2.2e-16 ***
computer - job search    4.41086  0.47582  9.2699 < 2.2e-16 ***
language - job search    3.73521  0.43112  8.6639 < 2.2e-16 ***
computer - vocational    0.40191  0.66017  0.6088    0.5427    
language - vocational   -0.27374  0.62994 -0.4345    0.6639    
language - computer     -0.67565  0.62361 -1.0834    0.2786    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "Vocational"
                             ATET        SE       t         p    
job search - no program -1.382461  0.155460 -8.8927 < 2.2e-16 ***
vocational - no program  2.787897  0.392552  7.1020 1.243e-12 ***
computer - no program    2.575450  0.467496  5.5090 3.622e-08 ***
language - no program    2.867676  0.402898  7.1176 1.110e-12 ***
vocational - job search  4.170358  0.423084  9.8570 < 2.2e-16 ***
computer - job search    3.957911  0.485932  8.1450 3.863e-16 ***
language - job search    4.250137  0.426540  9.9642 < 2.2e-16 ***
computer - vocational   -0.212447  0.587166 -0.3618    0.7175    
language - vocational    0.079779  0.545000  0.1464    0.8836    
language - computer      0.292226  0.591884  0.4937    0.6215    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "Computer"
                            ATET       SE       t         p    
job search - no program -1.26680  0.15684 -8.0769 6.761e-16 ***
vocational - no program  3.10856  0.48660  6.3884 1.688e-10 ***
computer - no program    3.56250  0.39968  8.9134 < 2.2e-16 ***
language - no program    3.37081  0.47934  7.0322 2.053e-12 ***
vocational - job search  4.37536  0.50556  8.6545 < 2.2e-16 ***
computer - job search    4.82930  0.42993 11.2329 < 2.2e-16 ***
language - job search    4.63761  0.50019  9.2717 < 2.2e-16 ***
computer - vocational    0.45394  0.60194  0.7541    0.4508    
language - vocational    0.26225  0.66088  0.3968    0.6915    
language - computer     -0.19169  0.59775 -0.3207    0.7485    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "Language"
                            ATET       SE       t         p    
job search - no program -0.60883  0.19432 -3.1332 0.0017301 ** 
vocational - no program  4.02450  0.79810  5.0426 4.605e-07 ***
computer - no program    3.36646  0.73306  4.5924 4.391e-06 ***
language - no program    1.04701  0.29307  3.5725 0.0003538 ***
vocational - job search  4.63333  0.80990  5.7208 1.065e-08 ***
computer - job search    3.97529  0.74590  5.3296 9.880e-08 ***
language - job search    1.65585  0.33816  4.8966 9.776e-07 ***
computer - vocational   -0.65804  1.05962 -0.6210 0.5345920    
language - vocational   -2.97748  0.84299 -3.5320 0.0004127 ***
language - computer     -2.31944  0.78089 -2.9703 0.0029766 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



Like in the paper, we focus here and in the following on the comparison of programs vs. no program and ignore the comparisons between programs. The average effects are very similar to those reported in the paper, which shows that the average effects are robust to the alternative seed.

# Plot ATE and ATET together
ates = cDML$ATE$results[1:4,1:2]
df = data.frame(Estimand = c(rep("ATE",4),rep("ATET",4)),
                program = factor(rep(label_w[-1],2),label_w[-1]),
                effect = c(ates[,1],atets[,1]))
df$low = df$effect - 1.96 * c(ates[,2],atets[,2])
df$up = df$effect + 1.96 * c(ates[,2],atets[,2])

ggplot(data = df,mapping = aes(x = program, y = effect,ymin = low,
                                   ymax = up,group = Estimand, linetype = Estimand)) +
          geom_point(size=2,position = position_dodge(width = 0.4)) +  
          geom_errorbar(width=0.2,position = position_dodge(width = 0.4))  + 
          theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1),axis.title.x=element_blank()) +
          ylab("Average treatment effects") + geom_hline(yintercept = 0) +
          theme(text=element_text(family="serif",size = 14, colour="black")) 




Heterogeneous effects

The output of the causalDML command contains an APO_dml object and an ATE_dml object that carry a gamma and delta matrix storing the doubly robust APO and ATE scores, respectively. The delta matrix of the ATE_dml object can be used to conveniently estimate heterogeneous effects.

GATEs

Group average treatment effects (GATEs) represent the average effects in different subgroups. Such subgroup analyses are an integral part of most program evaluations and usually require to rerun the preferred method in subsamples. Using the doubly robust ATE score as pseudo outcome in a standard OLS regression and a group indicator as explanatory variable estimates GATEs.

The results shown in the following are very similar to those presented in Table 4 of the paper, suggesting that also the GATEs are not sensitive to the choice of the seed.

Gender

## GATEs gender
for (i in 1:4) {
  print(label_w[i+1])
  temp_ols = lm(cDML$ATE$delta[,i] ~ z_blp$female)
  print(coeftest(temp_ols, vcov = vcovHC(temp_ols)))
}
[1] "Job search"

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  -1.30196    0.16520 -7.8811 3.296e-15 ***
z_blp$female  0.59799    0.25324  2.3613   0.01821 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

[1] "Vocational"

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept)   3.76423    0.54987  6.8457 7.679e-12 ***
z_blp$female -1.23955    0.86366 -1.4352    0.1512    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

[1] "Computer"

t test of coefficients:

             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.15071    0.59528  3.6129 0.000303 ***
z_blp$female  2.46847    0.85507  2.8869 0.003892 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

[1] "Language"

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept)   3.24428    0.46511  6.9753 3.082e-12 ***
z_blp$female -1.71740    0.77643 -2.2119   0.02698 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Foreigner

## GATEs foreigner
for (i in 1:4) {
  print(label_w[i+1])
  temp_ols = lm(cDML$ATE$delta[,i] ~ z_blp$foreigner)
  print(coeftest(temp_ols, vcov = vcovHC(temp_ols)))
}
[1] "Job search"

t test of coefficients:

                Estimate Std. Error t value  Pr(>|t|)    
(Intercept)     -1.29340    0.15933 -8.1175 4.843e-16 ***
z_blp$foreigner  0.70197    0.25746  2.7265  0.006403 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

[1] "Vocational"

t test of coefficients:

                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      2.41846    0.53053  4.5586 5.16e-06 ***
z_blp$foreigner  2.20381    0.88661  2.4857  0.01293 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

[1] "Computer"

t test of coefficients:

                Estimate Std. Error t value  Pr(>|t|)    
(Intercept)      3.66128    0.49838  7.3464 2.061e-13 ***
z_blp$foreigner -1.16680    0.93833 -1.2435    0.2137    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

[1] "Language"

t test of coefficients:

                Estimate Std. Error t value  Pr(>|t|)    
(Intercept)      3.47048    0.52174  6.6518 2.919e-11 ***
z_blp$foreigner -2.70691    0.71890 -3.7654 0.0001665 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Employability

## GATEs employability
for (i in 1:4) {
  print(label_w[i+1])
  temp_ols = lm(cDML$ATE$delta[,i] ~ z_blp$employability)
  print(coeftest(temp_ols, vcov = vcovHC(temp_ols)))
}
[1] "Job search"

t test of coefficients:

                     Estimate Std. Error t value Pr(>|t|)   
(Intercept)          -0.20813    0.32970 -0.6313  0.52786   
z_blp$employability2 -0.90582    0.36020 -2.5148  0.01191 * 
z_blp$employability3 -1.62687    0.49524 -3.2850  0.00102 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

[1] "Vocational"

t test of coefficients:

                     Estimate Std. Error t value  Pr(>|t|)    
(Intercept)            5.6555     1.0312  5.4845 4.164e-08 ***
z_blp$employability2  -2.6602     1.1489 -2.3156  0.020585 *  
z_blp$employability3  -4.7576     1.5087 -3.1534  0.001614 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

[1] "Computer"

t test of coefficients:

                     Estimate Std. Error t value  Pr(>|t|)    
(Intercept)            5.3963     1.0863  4.9676 6.796e-07 ***
z_blp$employability2  -2.4153     1.1966 -2.0186   0.04353 *  
z_blp$employability3  -3.7462     1.6566 -2.2613   0.02374 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

[1] "Language"

t test of coefficients:

                     Estimate Std. Error t value  Pr(>|t|)    
(Intercept)           2.86716    0.84162  3.4067 0.0006579 ***
z_blp$employability2 -0.49344    0.95351 -0.5175 0.6048096    
z_blp$employability3 -0.11146    1.46490 -0.0761 0.9393486    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



Best linear prediction of GATE

We can also model effect heterogeneity in a multivariate OLS. Again the ATE score is reused as a pseudo-outcome in a plain vanilla OLS regression with heteroscedasticity robust standard errors. The estimated coefficients are again mostly in line with the ones reported in Table 5 of the paper.

## BLP CATEs
for (i in 1:4) {
  print(label_w[i+1])
  temp_df = data.frame(y = cDML$ATE$delta[,i])
  temp_df = cbind(temp_df,z_blp)
  temp_ols = lm("y ~ .",data = temp_df)
  print(coeftest(temp_ols, vcov = vcovHC(temp_ols)))
}
[1] "Job search"

t test of coefficients:

                    Estimate Std. Error t value  Pr(>|t|)    
(Intercept)        -0.427987   0.703382 -0.6085   0.54288    
female              0.236418   0.266986  0.8855   0.37589    
age                 0.022365   0.014863  1.5048   0.13238    
foreigner           0.489428   0.264692  1.8491   0.06446 .  
employability2     -0.617320   0.369749 -1.6696   0.09501 .  
employability3     -1.148397   0.510107 -2.2513   0.02437 *  
past_incomein10000 -0.264551   0.064185 -4.1217 3.766e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

[1] "Vocational"

t test of coefficients:

                    Estimate Std. Error t value Pr(>|t|)   
(Intercept)         4.573663   2.386915  1.9161 0.055352 . 
female             -2.169673   0.921052 -2.3556 0.018494 * 
age                 0.100046   0.049381  2.0260 0.042770 * 
foreigner           1.583702   0.899843  1.7600 0.078417 . 
employability2     -1.821302   1.169122 -1.5578 0.119277   
employability3     -3.365911   1.538792 -2.1874 0.028719 * 
past_incomein10000 -0.687101   0.230396 -2.9823 0.002862 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

[1] "Computer"

t test of coefficients:

                    Estimate Std. Error t value Pr(>|t|)  
(Intercept)         4.958203   2.447231  2.0260  0.04276 *
female              1.837452   0.911824  2.0151  0.04389 *
age                 0.045730   0.050252  0.9100  0.36281  
foreigner          -1.507276   0.962790 -1.5655  0.11746  
employability2     -2.129763   1.219728 -1.7461  0.08080 .
employability3     -3.350109   1.684136 -1.9892  0.04668 *
past_incomein10000 -0.405734   0.187346 -2.1657  0.03034 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

[1] "Language"

t test of coefficients:

                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         6.026019   2.091576  2.8811 0.003964 ** 
female             -1.344921   0.809590 -1.6612 0.096671 .  
age                -0.067032   0.044410 -1.5094 0.131202    
foreigner          -2.649330   0.741761 -3.5717 0.000355 ***
employability2     -1.092292   0.978464 -1.1163 0.264284    
employability3     -1.160974   1.502992 -0.7724 0.439856    
past_incomein10000  0.324794   0.179486  1.8096 0.070366 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
rm(temp_df,temp_ols)



Kernel regression GATEs

Next, we reuse the ATE score as pseudo outcome in a one-dimensional kernel regression. The causalDML package implements this based on the np package.

The figures show the results in the following order: job search, vocational, computer and language. They are all very similar to the ones provided in the paper.

## Kernel regression CATEs
list_kr_cates_inc = vector("list",4)
for (i in 1:4) {
  list_kr_cates_inc[[i]] = kr_cate(cDML$ATE$delta[,i],x[,"past_income"])
}
# Plot Kernel regression CATEs
for (i in 1:4) {
  print(plot(list_kr_cates_inc[[i]],z_label="Past income",yrange=c(-5,10)))
}



Series regression GATEs

Next, we reuse the ATE score as pseudo outcome in a one-dimensional series regression. The causalDML package implements this based on the crs package.

The figures show the results in the following order: job search, vocational, computer and language. The only notable difference occurs for language programs. The estmated function in the paper is S-shaped, while here it is linear. However, the observation that job seekers with lower previous income benefit less from language training is by and large confirmed.

## Series regression GATEs
list_sr_cates_inc = vector("list",4)
for (i in 1:4) {
  list_sr_cates_inc[[i]] = spline_cate(cDML$ATE$delta[,i],x[,"past_income"])
}
# Plot Series regression GATEs
for (i in 1:4) {
  print(plot(list_sr_cates_inc[[i]],z_label="Past income",yrange=c(-5,10)))
}



Individualized average treatment effects

We further predict the individualized average treatment effects (IATEs) using the DR- and the NDR-learner. Like in the paper, we focus on the out-of-sample variant described in Algorithms 1 and 2 of the paper that are implemented using the ndr_learner function of the causalDML package.

The results confirm the potentially erratic behavior of the DR-learner. Like in the paper, predicted IATEs can take values outside of the possible bounds. However, the outliers here occur also for vocational training. Most importantly, the NDR-learner stabilizes the estimates substantially also in this replication.

## (N)DR-learner
ndr = ndr_learner(y,w,x,ml_w = list(forest),ml_y = list(forest),ml_tau = list(forest),compare_all = FALSE)
# Plot the results
df_box = NULL
for (i in 1:4) {
  df = data.frame("DRL" = ndr$cates[i,,1], "NDRL" = ndr$cates[i,,2])
  df = gather(df)
  df = cbind(label_w[i+1],df)
  colnames(df)[1] = "label"
  df_box = rbind(df_box,df)
}
ggplot(data=df_box) + geom_boxplot( aes(x=factor(label,label_w[-1]),y=value,fill=key)) +
  theme_bw() + theme(axis.title.x=element_blank(),legend.title = element_blank()) +
  ylab("Individualized average treatment effect") + geom_hline(yintercept = 0) + geom_hline(yintercept = -31,linetype="dashed") +
  geom_hline(yintercept = 31,linetype="dashed") +
  theme(text=element_text(family="serif",size = 16, colour="black"),axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_fill_grey(start = 0.9,end=0.4)


Also the classification analysis shows that the same eight variables as in Table 6 of the paper are most predictive of being in the highest or lowest quintile of the IATE distribution:

# Classification analysis
diff = matrix(NA,ncol(x),4)
rownames(diff) = label_x
colnames(diff) = label_w[-1]
for (i in 1:4) {
  diff[,i] = clan(ndr$cates[i,,2],scale(x))[,5][-1] - clan(ndr$cates[i,,2],scale(x))[,1][-1]
}
round(diff[order(rowMaxs(abs(diff)),decreasing=T),][1:8,],digits=2)
                                                 Job search Vocational Computer Language
Previous job: unskilled worker                         1.03       0.58     0.34    -1.32
Mother tongue other than German, French, Italian       0.74       0.57     0.00    -1.30
Past income                                           -1.29      -0.92    -1.14     1.17
Swiss citizen                                         -0.73      -0.49     0.07     1.25
Qualification: some degree                            -0.96      -0.54    -0.40     1.24
Qualification: unskilled                               0.84       0.34     0.31    -1.14
Previous job: skilled worker                          -0.81      -0.31    -0.12     0.96
Fraction of months employed last 2 years              -0.92      -0.34    -0.45     0.31




Optimal assigment rules

Finally, the gamma matrix of the APO_dml object can be used as input for the policytree R package. Figure 5 in the paper displayed the chosen program in the final leaf, which required messing with the code of the package. Here, we apply the standard package plot function that numerates the programs: (1) no program, (2) job search, (3) vocational, (4) computer and (5) language.

We observe that the depth one trees with five covariates (first graph below) is identical to Figure 5a. However, the other trees chose different splits indicating that the choice of the seeds might matter for the estimated assignment rules.

## Estimate optimal assignment rule estimation
# Depth one
pt_low1 = policy_tree(x_pt_low,cDML$APO$gamma,depth = 1)
pt_high1 = policy_tree(x_pt_high,cDML$APO$gamma,depth = 1)

# Depth two
pt_low2 = policy_tree(x_pt_low,cDML$APO$gamma,depth = 2)
pt_high2 = policy_tree(x_pt_high,cDML$APO$gamma,depth = 2)
## Plot optimal assignment rule estimation
plot(pt_low1)

plot(pt_high1)

plot(pt_low2)

plot(pt_high2)


# save.image("Notebook_image.RData")
---
title: "Replication Notebook"
output:
  html_notebook: default
---

# "Double Machine Learning based Program Evaluation under Unconfoundedness"
## Michael C. Knaus
### Version 2

This Notebook replicates the main results of the [paper](https://arxiv.org/abs/2003.03191) using a different seed (results in the paper are obtained with 1234, below 12345 is specified). The notebook should help to ...

- show the underlying code for researchers interested in replicating the analysis using the [same data](https://forsbase.unil.ch/project/study-public-overview/17035/1/) or adapting it to new datasets

- show that most of the results reported in the paper are not sensitive to the seed used to initialize the random number generators that are called for defining the cross-fitting folds and within random forests

The explanations throughout the notebook are kept short. Please see the [paper](https://arxiv.org/abs/2003.03191) for more explanations regarding the parameters of interest and their Double Machine Learning based estimation.

Click on the "Code" button on the top right of this notebook to "Hide All Code" and see only the results or "Download Rmd" to extract the underlying code.

Running the analyses in this notebook took roughly two days on a [SWITCHengine](https://www.switch.ch/engines/) with eight cores and 32 GB RAM.


<br>
<br>

## Preparation

The analysis is mostly based on the *[causalDML](https://github.com/MCKnaus/causalDML)* R-package that was developed in parallel to the paper. The preparation of the *"Database_SALMP_MCK2020.Rdata"* that is loaded can be replicated by requesting the [data via FORSbase](https://forsbase.unil.ch/project/study-public-overview/17035/1/) and running the [data preparation file](https://github.com/MCKnaus/mcknaus.github.io/blob/master/assets/code/Data_preparation_MCK2020.R).


```{r}
## Define and set seed
seed = 12345
set.seed(seed)

## Load relevant packages, in case you did not yet install causalDML uncomment the next two lines
# library(devtools)
# install_github(repo="MCKnaus/causalDML")
library(causalDML)
library(policytree)
library(tidyverse)
library(lmtest)
library(sandwich)
library(matrixStats)

## Set to path where data are stored for replication
# setwd("C:/Users/Administrator/switchdrive/Papers/Multi Treat Labor/Paper/Data")
load("Database_SALMP_MCK2020.RData")
```

<br>
<br>

# Average effects

The *causalDML* command estimates first the cross-fitted nuisance parameters and uses them to calculate the double robust scores for APO and ATE, which are then used to estimate the APO and ATE(T).

```{r}
# Create the ML method to be used for nuisance parameter prediction
forest = create_method("forest_grf",args=list(tune.parameters = "all",seed=seed))

# Run the main function that outputs nuisance parameters, APO and ATE
cDML = causalDML(y,w,x,ml_w=list(forest),ml_y=list(forest))
```

<br>
<br>

## Average potential outcome (APO)

The following table and plot shows the DML estimated APOs.

```{r}
summary(cDML$APO)
plot(cDML$APO,label_w)
```

<br>
<br>


## Average treatment effects (ATE and ATET)

### ATE

The following table shows the results of all possible ATEs given the five different treatments:

```{r}
summary(cDML$ATE)
```

<br>

### ATET

The following tables show the results for all possible ATETs given the five different treatments. The first line shows for example an effect of -1.045 for being in a job search program vs. no program for *those who were actually observed in no program*, or formally adapting the notation of the paper $E[Y(jo b~search) - Y(no~program) \mid W = no~program]$. As we can condition on all five programs, we get five tables:

```{r}
# Estimate ATETs for the programs
atets = matrix(NA,4,2)
for (i in 1:5) {
  print(label_w[i])
  APO_atet = APO_dml_atet(y,cDML$APO$m_mat,cDML$APO$w_mat,cDML$APO$e_mat,cDML$APO$cf_mat,treated=i)
  ATET = ATE_dml(APO_atet)
  summary(ATET)
  atets[i-1,] = ATET$results[i-1,1:2]
}
```

<br>
<br>


Like in the paper, we focus here and in the following on the comparison of programs vs. no program and ignore the comparisons between programs. The average effects are very similar to those reported in the paper, which shows that the average effects are robust to the alternative seed.

```{r}
# Plot ATE and ATET together
ates = cDML$ATE$results[1:4,1:2]
df = data.frame(Estimand = c(rep("ATE",4),rep("ATET",4)),
                program = factor(rep(label_w[-1],2),label_w[-1]),
                effect = c(ates[,1],atets[,1]))
df$low = df$effect - 1.96 * c(ates[,2],atets[,2])
df$up = df$effect + 1.96 * c(ates[,2],atets[,2])

ggplot(data = df,mapping = aes(x = program, y = effect,ymin = low,
                                   ymax = up,group = Estimand, linetype = Estimand)) +
          geom_point(size=2,position = position_dodge(width = 0.4)) +  
          geom_errorbar(width=0.2,position = position_dodge(width = 0.4))  + 
          theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1),axis.title.x=element_blank()) +
          ylab("Average treatment effects") + geom_hline(yintercept = 0) +
          theme(text=element_text(family="serif",size = 14, colour="black")) 
```

<br>
<br>
<br>

# Heterogeneous effects

The output of the *causalDML* command contains an *APO_dml* object and an *ATE_dml* object that carry a *gamma* and *delta* matrix storing the doubly robust APO and ATE scores, respectively. The *delta* matrix of the *ATE_dml* object can be used to conveniently estimate heterogeneous effects.

## GATEs

Group average treatment effects (GATEs) represent the average effects in different subgroups. Such subgroup analyses are an integral part of most program evaluations and usually require to rerun the preferred method in subsamples. Using the doubly robust ATE score as pseudo outcome in a standard OLS regression and a group indicator as explanatory variable estimates GATEs.

The results shown in the following are very similar to those presented in Table 4 of the paper, suggesting that also the GATEs are not sensitive to the choice of the seed.

### Gender

```{r}
## GATEs gender
for (i in 1:4) {
  print(label_w[i+1])
  temp_ols = lm(cDML$ATE$delta[,i] ~ z_blp$female)
  print(coeftest(temp_ols, vcov = vcovHC(temp_ols)))
}
```

<br>

### Foreigner

```{r}
## GATEs foreigner
for (i in 1:4) {
  print(label_w[i+1])
  temp_ols = lm(cDML$ATE$delta[,i] ~ z_blp$foreigner)
  print(coeftest(temp_ols, vcov = vcovHC(temp_ols)))
}
```
<br>


### Employability

```{r}
## GATEs employability
for (i in 1:4) {
  print(label_w[i+1])
  temp_ols = lm(cDML$ATE$delta[,i] ~ z_blp$employability)
  print(coeftest(temp_ols, vcov = vcovHC(temp_ols)))
}
```

<br>
<br>

## Best linear prediction of GATE

We can also model effect heterogeneity in a multivariate OLS. Again the ATE score is reused as a pseudo-outcome in a plain vanilla OLS regression with heteroscedasticity robust standard errors. The estimated coefficients are again mostly in line with the ones reported in Table 5 of the paper.


```{r}
## BLP CATEs
for (i in 1:4) {
  print(label_w[i+1])
  temp_df = data.frame(y = cDML$ATE$delta[,i])
  temp_df = cbind(temp_df,z_blp)
  temp_ols = lm("y ~ .",data = temp_df)
  print(coeftest(temp_ols, vcov = vcovHC(temp_ols)))
}
rm(temp_df,temp_ols)
```

<br>
<br>

## Kernel regression GATEs

Next, we reuse the ATE score as pseudo outcome in a one-dimensional kernel regression. The  *[causalDML](https://github.com/MCKnaus/causalDML)* package implements this based on the *[np](https://cran.r-project.org/web/packages/np/index.html)* package.

The figures show the results in the following order: job search, vocational, computer and language. They are all very similar to the ones provided in the paper.


```{r}
## Kernel regression CATEs
list_kr_cates_inc = vector("list",4)
for (i in 1:4) {
  list_kr_cates_inc[[i]] = kr_cate(cDML$ATE$delta[,i],x[,"past_income"])
}
```

```{r}
# Plot Kernel regression CATEs
for (i in 1:4) {
  print(plot(list_kr_cates_inc[[i]],z_label="Past income",yrange=c(-5,10)))
}
```

<br>
<br>

## Series regression GATEs

Next, we reuse the ATE score as pseudo outcome in a one-dimensional series regression. The  *[causalDML](https://github.com/MCKnaus/causalDML)* package implements this based on the *[crs](https://cran.r-project.org/web/packages/crs/index.html)* package.

The figures show the results in the following order: job search, vocational, computer and language. The only notable difference occurs for language programs. The estmated function in the paper is S-shaped, while here it is linear. However, the observation that job seekers with lower previous income benefit less from language training is by and large confirmed.


```{r}
## Series regression GATEs
list_sr_cates_inc = vector("list",4)
for (i in 1:4) {
  list_sr_cates_inc[[i]] = spline_cate(cDML$ATE$delta[,i],x[,"past_income"])
}
```

```{r}
# Plot Series regression GATEs
for (i in 1:4) {
  print(plot(list_sr_cates_inc[[i]],z_label="Past income",yrange=c(-5,10)))
}
```

<br>
<br>

## Individualized average treatment effects

We further predict the individualized average treatment effects (IATEs) using the DR- and the NDR-learner. Like in the paper, we focus on the out-of-sample variant described in Algorithms 1 and 2 of the paper that are implemented using the *ndr_learner* function of the *[causalDML](https://github.com/MCKnaus/causalDML)* package.

The results confirm the potentially erratic behavior of the DR-learner. Like in the paper, predicted IATEs can take values outside of the possible bounds. However, the outliers here occur also for vocational training. Most importantly, the NDR-learner stabilizes the estimates substantially also in this replication.


```{r}
## (N)DR-learner
ndr = ndr_learner(y,w,x,ml_w = list(forest),ml_y = list(forest),ml_tau = list(forest),compare_all = FALSE)
```



```{r}
# Plot the results
df_box = NULL
for (i in 1:4) {
  df = data.frame("DRL" = ndr$cates[i,,1], "NDRL" = ndr$cates[i,,2])
  df = gather(df)
  df = cbind(label_w[i+1],df)
  colnames(df)[1] = "label"
  df_box = rbind(df_box,df)
}
ggplot(data=df_box) + geom_boxplot( aes(x=factor(label,label_w[-1]),y=value,fill=key)) +
  theme_bw() + theme(axis.title.x=element_blank(),legend.title = element_blank()) +
  ylab("Individualized average treatment effect") + geom_hline(yintercept = 0) + geom_hline(yintercept = -31,linetype="dashed") +
  geom_hline(yintercept = 31,linetype="dashed") +
  theme(text=element_text(family="serif",size = 16, colour="black"),axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_fill_grey(start = 0.9,end=0.4)
```

<br>

Also the classification analysis shows that the same eight variables as in Table 6 of the paper are most predictive of being in the highest or lowest quintile of the IATE distribution:

```{r}
# Classification analysis
diff = matrix(NA,ncol(x),4)
rownames(diff) = label_x
colnames(diff) = label_w[-1]
for (i in 1:4) {
  diff[,i] = clan(ndr$cates[i,,2],scale(x))[,5][-1] - clan(ndr$cates[i,,2],scale(x))[,1][-1]
}
round(diff[order(rowMaxs(abs(diff)),decreasing=T),][1:8,],digits=2)
```

<br>
<br>
<br>

## Optimal assigment rules

Finally, the *gamma* matrix of the *APO_dml* object can be used as input for the *[policytree](https://cran.r-project.org/web/packages/policytree/index.html)* R package. Figure 5 in the paper displayed the chosen program in the final leaf, which required messing with the code of the package. Here, we apply the standard package *plot* function that numerates the programs: (1) no program, (2) job search, (3) vocational, (4) computer and (5) language.

We observe that the depth one trees with five covariates (first graph below) is identical to Figure 5a. However, the other trees chose different splits indicating that the choice of the seeds might matter for the estimated assignment rules.

```{r}
## Estimate optimal assignment rule estimation
# Depth one
pt_low1 = policy_tree(x_pt_low,cDML$APO$gamma,depth = 1)
pt_high1 = policy_tree(x_pt_high,cDML$APO$gamma,depth = 1)

# Depth two
pt_low2 = policy_tree(x_pt_low,cDML$APO$gamma,depth = 2)
pt_high2 = policy_tree(x_pt_high,cDML$APO$gamma,depth = 2)
```


```{r}
## Plot optimal assignment rule estimation
plot(pt_low1)
plot(pt_high1)
plot(pt_low2)
plot(pt_high2)

# save.image("Notebook_image.RData")
```



