Goals:

  • Estimate subgroup effects

  • Estimate best linear predictor of heterogeneity

  • Estimate nonparametric heterogeneity with kernel and spline regression


Introducing the data

The Application Notebook builds on the dataset that is kindly provided in the hdm package. The data was used in Chernozhukov and Hansen (2004). Their paper investigates the effect of participation in the employer-sponsored 401(k) retirement savings plan (p401) on net assets (net_tfa). Since then the data was used to showcase many new methods. It is not the most comprehensive dataset with basically ten covariates/regressors/predictors:

  • age: age

  • db: defined benefit pension

  • educ: education (in years)

  • fsize: family size

  • hown: home owner

  • inc: income (in US $)

  • male: male

  • marr: married

  • pira: participation in individual retirement account (IRA)

  • twoearn: two earners

However, it is publicly available and the relatively few covariates ensure that the programs do not run too long.

# To install the causalDML package uncomment the following two lines
# library(devtools)
# install_github(repo="MCKnaus/causalDML")

# Load the packages required for later
library(hdm)
library(tidyverse)
library(causalDML)
library(grf)
library(estimatr)


set.seed(1234) # for replicability
options(scipen = 10) # Switch off scientific notation

data(pension)
# Outcome
Y = pension$net_tfa
# Treatment
W = pension$p401
# Create main effects matrix
X = model.matrix(~ 0 + age + db + educ + fsize + hown + inc + male + marr + pira + twoearn, data = pension)



Double ML for AIPW with causalDML package

Like previous notebooks, we create the pseudo-outcome \[\tilde{Y}_{ATE} = \underbrace{\hat{m}(1,X) - \hat{m}(0,X)}_{\text{outcome predictions}} + \underbrace{\frac{W (Y - \hat{m}(1,X))}{\hat{e}(X)} - \frac{(1-W) (Y - \hat{m}(0,X))}{1-\hat{e}(X)}}_{\text{weighted residuals}}\]

by running the DML_aipw function:

# 5-fold cross-fitting with causalDML package
aipw = DML_aipw(Y,W,X)
summary(aipw$ATE)
          ATE      SE      t         p    
1 - 0 11288.4  1166.9 9.6736 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# If you have more time, tune the forest
# forest = create_method("forest_grf",args=list(tune.parameters = "all"))
# aipw = DML_aipw(Y,W,X,ml_w=list(forest),ml_y=list(forest),cf=5)



GATE estimation

The pseudo-outcome can now be used to estimate different heterogeneous effects. We use standard regression models but by using the pseudo-outcome instead of a real outcome we model effect size and not outcome level.

Subgroup effect

First, let’s check a classic. Gender differences. This would usually be implemented by splitting the sample by gender and rerunning the whole analysis in the subsamples separately.

With the pseudo-outcome stored in aipw$ATE$delta this boils down to running an OLS regression with the male indicator as single regressor.

male = X[,7]
blp_male = lm_robust(aipw$ATE$delta ~ male)
summary(blp_male)

Call:
lm_robust(formula = aipw$ATE$delta ~ male)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
(Intercept)    10691       1303  8.2076 2.537e-16     8138    13245 9913
male            2899       2929  0.9895 3.224e-01    -2843     8641 9913

Multiple R-squared:  0.0001018 ,    Adjusted R-squared:  8.978e-07 
F-statistic: 0.9791 on 1 and 9913 DF,  p-value: 0.3224

We can interpret this outcome as we are used to, only that we model now effect size. This means the intercept gives us the average of the reference group (women) and the coefficient tells us how much higher the effect is for men. In this case, we find no significant gender differences in the effect of 401(k) participation on net wealth.

If you are interested in the gender specific effect instead of differences between groups, just run an OLS regression without constant and all group indicators:

female = 1-male
blp_male1 = lm_robust(aipw$ATE$delta ~ 0 + female + male)
summary(blp_male1)

Call:
lm_robust(formula = aipw$ATE$delta ~ 0 + female + male)

Standard error type:  HC2 

Coefficients:
       Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
female    10691       1303   8.208 2.537e-16     8138    13245 9913
male      13590       2624   5.180 2.267e-07     8447    18733 9913

Multiple R-squared:  0.009451 , Adjusted R-squared:  0.009251 
F-statistic:  47.1 on 2 and 9913 DF,  p-value: < 2.2e-16

You see that we can transfer all the strategies that we know about modeling outcomes with OLS for modelling causal effects.


Best linear prediction

Maybe we do not want to focus on subgroup analyses but to model the effect using all main effects at our disposal. In standard OLS this would mean to include a lot of interaction effects while completely relying on correct specification of the outcome model.

Using the pseudo-outcome allows us to be completely agnostic about the outcome model and to receive a nice summary of the underlying effect heterogeneity in a familiar format, an OLS output:

blp = lm_robust(aipw$ATE$delta ~ X)
summary(blp)

Call:
lm_robust(formula = aipw$ATE$delta ~ X)

Standard error type:  HC2 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    CI Lower   CI Upper   DF
(Intercept) -16173.07643  8508.1126 -1.9009 0.057344 -32850.7088   504.5559 9904
Xage           302.55404   105.7100  2.8621 0.004217     95.3409   509.7672 9904
Xdb           5655.01827  2873.3302  1.9681 0.049084     22.7063 11287.3302 9904
Xeduc          438.08990   614.0359  0.7135 0.475578   -765.5455  1641.7253 9904
Xfsize         121.76803   803.8349  0.1515 0.879597  -1453.9119  1697.4480 9904
Xhown         5737.88595  2162.7697  2.6530 0.007990   1498.4171  9977.3548 9904
Xinc             0.04715     0.1977  0.2385 0.811532     -0.3404     0.4347 9904
Xmale         5347.95842  3097.7852  1.7264 0.084310   -724.3310 11420.2479 9904
Xmarr         1277.05730  3572.4932  0.3575 0.720748  -5725.7565  8279.8711 9904
Xpira        -1316.68641  3773.1158 -0.3490 0.727123  -8712.7613  6079.3885 9904
Xtwoearn      1097.38011  4751.4033  0.2310 0.817351  -8216.3374 10411.0976 9904

Multiple R-squared:  0.003032 , Adjusted R-squared:  0.002025 
F-statistic: 5.272 on 10 and 9904 DF,  p-value: 0.00000008826

For example, we see that, all other regressors held constant, being one year older increases the effect of 401(k) participation of wealth on average by $ 303.

Again we realize that everything we learned about OLS for modelling outcomes directly translates to modeling effect sizes.


Non-parametric heterogeneity

The imho coolest thing about having the pseudo-outcome is that we can also estimate heterogeneous effects with nonparametric regressions. This means we are not only agnostic about the outcome and propensity score models but also about the functional of effect heterogeneity.

This is especially useful if we have some continuous variable like age for which we want to understand effect heterogeneity.


Spline regression

The spline_cate function implements spline regression reusing the pseudo-outcome:

age = X[,1]
sr_age = spline_cate(aipw$ATE$delta,age)
plot(sr_age,z_label = "Age")


Kernel regression

The kr_cate function implements kernel regression reusing the pseudo-outcome:

kr_age = kr_cate(aipw$ATE$delta,age)
plot(kr_age,z_label = "Age")

Both nonparametric approaches document the same pattern. The effect of 401(k) participation on net wealth increases more or less linearly until the age of 50 and then slightly drops.

For me this is an incredible new option for our policy evaluation toolbox.



---
title: "Causal ML: Double ML for group average treatment effects"
subtitle: "Application notebook"
author: "Michael Knaus"
date: "`r format(Sys.time(), '%m/%y')`"
output: 
  html_notebook:
    toc: true
    toc_float: true
    code_folding: show
---


Goals:

- Estimate subgroup effects

- Estimate best linear predictor of heterogeneity 

- Estimate nonparametric heterogeneity with kernel and spline regression

<br>

# Introducing the data

The Application Notebook builds on the dataset that is kindly provided in the `hdm` package. The data was used in [Chernozhukov and Hansen (2004)](https://direct.mit.edu/rest/article/86/3/735/57586/The-Effects-of-401-K-Participation-on-the-Wealth). Their paper investigates the effect of participation in the employer-sponsored 401(k) retirement savings plan (`p401`) on net assets (`net_tfa`). Since then the data was used to showcase many new methods. It is not the most comprehensive dataset with basically ten covariates/regressors/predictors:

- *age*: age

- *db*: defined benefit pension

- *educ*: education (in years)

- *fsize*: family size

- *hown*: home owner

- *inc*: income (in US $)

- *male*: male

- *marr*: married

- *pira*: participation in individual retirement account (IRA)

- *twoearn*: two earners

However, it is publicly available and the relatively few covariates ensure that the programs do not run too long.

```{r, warning = F, message = F}
# To install the causalDML package uncomment the following two lines
# library(devtools)
# install_github(repo="MCKnaus/causalDML")

# Load the packages required for later
library(hdm)
library(tidyverse)
library(causalDML)
library(grf)
library(estimatr)


set.seed(1234) # for replicability
options(scipen = 10) # Switch off scientific notation

data(pension)
# Outcome
Y = pension$net_tfa
# Treatment
W = pension$p401
# Create main effects matrix
X = model.matrix(~ 0 + age + db + educ + fsize + hown + inc + male + marr + pira + twoearn, data = pension)
```

<br>
<br>

# Double ML for AIPW with `causalDML` package

Like previous notebooks, we create the pseudo-outcome
$$\tilde{Y}_{ATE} = \underbrace{\hat{m}(1,X) - \hat{m}(0,X)}_{\text{outcome predictions}} + \underbrace{\frac{W (Y - \hat{m}(1,X))}{\hat{e}(X)} - \frac{(1-W) (Y - \hat{m}(0,X))}{1-\hat{e}(X)}}_{\text{weighted residuals}}$$

by running the `DML_aipw` function:

```{r}
# 5-fold cross-fitting with causalDML package
aipw = DML_aipw(Y,W,X)
summary(aipw$ATE)

# If you have more time, tune the forest
# forest = create_method("forest_grf",args=list(tune.parameters = "all"))
# aipw = DML_aipw(Y,W,X,ml_w=list(forest),ml_y=list(forest),cf=5)
```

<br>
<br>


# GATE estimation

The pseudo-outcome can now be used to estimate different heterogeneous effects. We use standard regression models but by using the pseudo-outcome instead of a real outcome we model effect size and not outcome level.

## Subgroup effect

First, let's check a classic. Gender differences. This would usually be implemented by splitting the sample by gender and rerunning the whole analysis in the subsamples separately.

With the pseudo-outcome stored in `aipw$ATE$delta` this boils down to running an OLS regression with the `male` indicator as single regressor.

```{r}
male = X[,7]
blp_male = lm_robust(aipw$ATE$delta ~ male)
summary(blp_male)
```

We can interpret this outcome as we are used to, only that we model now effect size. This means the intercept gives us the average of the reference group (women) and the coefficient tells us how much higher the effect is for men. In this case, we find no significant gender differences in the effect of 401(k) participation on net wealth.

If you are interested in the gender specific effect instead of differences between groups, just run an OLS regression without constant and all group indicators:

```{r}
female = 1-male
blp_male1 = lm_robust(aipw$ATE$delta ~ 0 + female + male)
summary(blp_male1)
```

You see that we can transfer all the strategies that we know about modeling outcomes with OLS for modelling causal effects.

<br>

## Best linear prediction

Maybe we do not want to focus on subgroup analyses but to model the effect using all main effects at our disposal. In standard OLS this would mean to include a lot of interaction effects while completely relying on correct specification of the outcome model.

Using the pseudo-outcome allows us to be completely agnostic about the outcome model and to receive a nice summary of the underlying effect heterogeneity in a familiar format, an OLS output:

```{r}
blp = lm_robust(aipw$ATE$delta ~ X)
summary(blp)
```

For example, we see that, all other regressors held constant, being one year older increases the effect of 401(k) participation of wealth on average by $ `r round(blp$coefficients[2])`.

Again we realize that everything we learned about OLS for modelling outcomes directly translates to modeling effect sizes.


<br>


## Non-parametric heterogeneity

The imho coolest thing about having the pseudo-outcome is that we can also estimate heterogeneous effects with nonparametric regressions. This means we are not only agnostic about the outcome and propensity score models but also about the functional of effect heterogeneity. 

This is especially useful if we have some continuous variable like age for which we want to understand effect heterogeneity.

<br>

### Spline regression

The `spline_cate` function implements spline regression reusing the pseudo-outcome:

```{r, results='hide'}
age = X[,1]
sr_age = spline_cate(aipw$ATE$delta,age)
```

```{r}
plot(sr_age,z_label = "Age")
```

<br>

### Kernel regression

The `kr_cate` function implements kernel regression reusing the pseudo-outcome:

```{r, results='hide'}
kr_age = kr_cate(aipw$ATE$delta,age)
```

```{r}
plot(kr_age,z_label = "Age")
```

Both nonparametric approaches document the same pattern. The effect of 401(k) participation on net wealth increases more or less linearly until the age of 50 and then slightly drops.

For me this is an incredible new option for our policy evaluation toolbox.

<br>
<br>