These are the goals of today:
We again use the data of the hdm
package. The data was
used in Chernozhukov
and Hansen (2004). Their paper investigates the effect of
participation in the employer-sponsored 401(k) retirement savings plan
(p401) on net assets (net_tfa). Since then, the data
was used to showcase many new methods. It is not the most comprehensive
dataset with basically ten covariates/regressors/predictors:
age: age
db: defined benefit pension
educ: education (in years)
fsize: family size
hown: home owner
inc: income (in US $)
male: male
marr: married
pira: participation in individual retirement account (IRA)
twoearn: two earners
However, it is publicly available and the few controls ensure that the programs run not as long as with datasets that you hope to have for your applications.
library(hdm)
library(grf)
library(tidyverse)
library(psych)
library(estimatr)
# library(devtools)
# install_github("susanathey/causalTree")
library(causalTree)
# library(devtools)
# install_github(repo="MCKnaus/causalDML")
library(causalDML)
set.seed(1234) # for replicability
# Get data
data(pension)
# Outcome
Y = pension$net_tfa
# Treatment
W = pension$p401
# Treatment
Z = pension$e401
# Create main effects matrix
X = model.matrix(~ 0 + age + db + educ + fsize + hown + inc + male + marr + pira + twoearn, data = pension)
We want to predict effects of 401(k) participation on net assets using the following estimators:
S-learner (to be defined)
T-learner
Causal Tree
Causal Forest
R-learner
DR-learner
We did not discuss it in class, but another straightforward way of getting CATEs is to model \(E[Y|W,X]\) and then evaluate it at \(E[Y|1,X] - E[Y|0,X]\)
WX = cbind(W,X)
rf = regression_forest(WX,Y)
W0X = cbind(rep(0,length(Y)),X)
W1X = cbind(rep(1,length(Y)),X)
cate_sl = predict(rf,W1X)$predictions - predict(rf,W0X)$predictions
hist(cate_sl)
For the T-learner, we implement the following procedure:
Use ML estimator of your choice to fit model \(\hat{m}(1,X)\) in treated}
Use ML estimator of your choice to fit model \(\hat{m}(0,X)\) in controls
Estimate CATE as \(\hat{\tau}(x) = \hat{m}(1,x) - \hat{m}(0,x)\)
rfm1 = regression_forest(X[W==1,],Y[W==1])
rfm0 = regression_forest(X[W==0,],Y[W==0])
cate_tl = predict(rfm1, X)$predictions - predict(rfm0, X)$predictions
hist(cate_tl)
Let’s move on with the Causal Tree, which is implemented in the
package and function causalTree
. The package is build for
experiments, so we do not control for confounding here. However, if the
data were experimental, this is what you would do to implement the
method:
# Prepare data frame
df = data.frame(X,Y)
# Implemented causalTree adapting specification from R example
ctree = causalTree(Y~X, data = df, treatment = W,
split.Rule = "CT", cv.option = "CT", split.Honest = T,split.Bucket = F, xval = 5,
cp = 0, minsize = 20)
opcp = ctree$cptable[,1][which.min(ctree$cptable[,4])]
opfit = prune(ctree, opcp)
rpart.plot(opfit)
We see in the parent node that the “treatment effect” of $27k is much larger then what we get by controlling for confounding where the effects ranged from $11k to $15k. We also do not know whether the splits are driven by effect heterogeneity or heterogeneous selection bias.
However, if we ignore confounding for the sake of the argument, the tree shows substantial heterogeneity regarding age and some other variables.
Given the obvious selection bias, we do not form predictions for causal trees.
grf
packageCausal Forests are implemented most conveniently with the
causal_forest
function of the grf
package. If
we use the predict
function for the created object, we get
out-of-bag estimated IATEs for every individual in our sample:
cf = causal_forest(X, Y, W)
cate_cf = predict(cf)$predictions
hist(cate_cf)
Now let’s see what the Causal Forest does in the background. To this end, we want to “manually” replicate the predicted effect of individual one. It has a predicted effect of 5292.817192 and has the following characteristics:
X[1,]
age db educ fsize hown inc male marr pira twoearn
31 0 12 5 1 28146 0 1 0 0
Recall that causal forest estimates CATEs using this formula: \[ \begin{equation} \hat{\tau}^{cf}(x) = argmin_{\tau} \left\{ \sum_{i=1}^N \alpha_i(x) \left[(Y_i - \hat{m}(X_i)) - \tau(x)(W_i - \hat{e}(X_i)) \right]^2 \right\}, \end{equation} \] where \(\alpha(x)\) are \(x\)-specific weights.
We can implement this ourselves this because the causal forest saves all components that were used in the process.
mhat = cf$Y.hat
ehat = cf$W.hat
get_forest_weights
:alphax = get_forest_weights(cf)[1,]
hist(as.numeric(alphax))
Yres = Y - mhat
Wres = W - ehat
manual_cate = lm(Yres ~ 0 + Wres,weights = as.numeric(alphax))$coefficients
manual_cate
Wres
5306.271
Let’s check whether this is equal to the causal_forest
prediction:
all.equal(as.numeric(cate_cf[1]),as.numeric(manual_cate))
[1] "Mean relative difference: 0.002541852"
Nope :-(
What did we miss?
The causal_forest
runs the weighted RORR with constant.
This constant should asymptotically be zero, but in finite samples it is
not:
Yres = Y - mhat
Wres = W - ehat
manual_cate_const = lm(Yres ~ Wres,weights = as.numeric(alphax))
summary(manual_cate_const)
Call:
lm(formula = Yres ~ Wres, weights = as.numeric(alphax))
Weighted Residuals:
Min 1Q Median 3Q Max
-4662 0 0 0 3353
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -604.9 187.4 -3.228 0.00126 **
Wres 5292.8 479.2 11.046 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 187.4 on 3151 degrees of freedom
Multiple R-squared: 0.03728, Adjusted R-squared: 0.03697
F-statistic: 122 on 1 and 3151 DF, p-value: < 2.2e-16
With constant the results from the package and our “manually” coded version coincide:
all.equal(as.numeric(cate_cf[1]),as.numeric(manual_cate_const$coefficients[2]))
[1] TRUE
After all, it boiled down to run OLS…
To illustrate how we can implement the R-leaner like this for assuming a linear CATE model \[ \begin{equation} \hat{\beta}^{rl} = argmin_{\beta} \sum_{i=1}^N \big(Y_i - \hat{m}(X_i) - X_i^* \beta \big)^2 \end{equation} \] we recycle the nuisance parameters of above and just need to define the modified/pseudo-covariates \(X^* = X(W - \hat{e}(X))\)
# Create residuals
res_y = Y-mhat
res_w = W-ehat
# Modify covariates (multiply each column including constant with residual)
n = length(Y)
X_wc = cbind(rep(1,n),X)
Xstar = X_wc * res_w
# Regress outcome residual on modified covariates
rl_ols = lm(res_y ~ 0 + Xstar)
summary(rl_ols)
Call:
lm(formula = res_y ~ 0 + Xstar)
Residuals:
Min 1Q Median 3Q Max
-507194 -10540 -1235 2272 1439820
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Xstar -1.614e+04 1.081e+04 -1.493 0.1354
Xstarage 3.173e+02 1.469e+02 2.160 0.0308 *
Xstardb 5.268e+03 2.904e+03 1.814 0.0697 .
Xstareduc 4.398e+02 5.561e+02 0.791 0.4291
Xstarfsize -8.555e+01 1.098e+03 -0.078 0.9379
Xstarhown 6.090e+03 3.259e+03 1.868 0.0617 .
Xstarinc 6.168e-02 6.567e-02 0.939 0.3476
Xstarmale 5.816e+03 3.631e+03 1.602 0.1092
Xstarmarr 3.476e+03 4.600e+03 0.756 0.4499
Xstarpira -9.279e+02 3.128e+03 -0.297 0.7667
Xstartwoearn -2.412e+03 3.658e+03 -0.659 0.5097
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 53830 on 9904 degrees of freedom
Multiple R-squared: 0.01336, Adjusted R-squared: 0.01226
F-statistic: 12.19 on 11 and 9904 DF, p-value: < 2.2e-16
Now we need to take the coefficients of this model to get fitted values of the CATEs as \(X \beta\) (!fitted values need to be calculated using \(X\), not \(X^*\), do not repeat a mistake I did in the beginning !)
cate_rl_ols = X_wc %*% rl_ols$coefficients
hist(cate_rl_ols)
The more generic alternative is to use the unmodified covariates in a weighted regression with pseudo outcomes: \[\hat{\tau}^{rl}(x) = argmin_{\tau} \sum_{i=1}^N \underbrace{(W_i - \hat{e}(X_i))^2}_{\text{weight}} \left(\underbrace{\frac{Y_i - \hat{m}(X_i)}{W_i - \hat{e}(X_i)}}_{\text{pseudo-outcome}} - X_i\beta \right)^2\]
For the sake of the argument run it again with OLS
# Create pseudo-outcome (outcome res divided by treatment res)
pseudo_rl = res_y / res_w
# Create weights
weights_rl = res_w^2
# Weighted regression of pseudo-outcome on covariates
rols_fit = lm(pseudo_rl ~ X, weights=weights_rl)
summary(rols_fit)
Call:
lm(formula = pseudo_rl ~ X, weights = weights_rl)
Weighted Residuals:
Min 1Q Median 3Q Max
-1217226 -6108 -72 6672 1439820
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.614e+04 1.081e+04 -1.493 0.1354
Xage 3.173e+02 1.469e+02 2.160 0.0308 *
Xdb 5.268e+03 2.904e+03 1.814 0.0697 .
Xeduc 4.398e+02 5.561e+02 0.791 0.4291
Xfsize -8.555e+01 1.098e+03 -0.078 0.9379
Xhown 6.090e+03 3.259e+03 1.868 0.0617 .
Xinc 6.168e-02 6.567e-02 0.939 0.3476
Xmale 5.816e+03 3.631e+03 1.602 0.1092
Xmarr 3.476e+03 4.600e+03 0.756 0.4499
Xpira -9.279e+02 3.128e+03 -0.297 0.7667
Xtwoearn -2.412e+03 3.658e+03 -0.659 0.5097
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 53830 on 9904 degrees of freedom
Multiple R-squared: 0.00233, Adjusted R-squared: 0.001322
F-statistic: 2.313 on 10 and 9904 DF, p-value: 0.01034
and observe that both implementations deliver identical results:
all.equal(as.numeric(rl_ols$coefficients),as.numeric(rols_fit$coefficients))
[1] TRUE
This was just for illustration that both representations provide identical results when solved with a method without random components. In practice, we can apply anything that solves the (weighted) minimization problem.
Going beyond pure illustration, we implement the R-learner now using random forest:
# Weighted regression with RF
rrf_fit = regression_forest(X,pseudo_rl, sample.weights = weights_rl)
cate_rl_rf = predict(rrf_fit)$predictions
hist(cate_rl_rf)
Finally, run the DR-learner that creates the pseudo-outcome in a first step \[ \begin{align} \tilde{Y}_{ATE} = \underbrace{\hat{m}(1,X) - \hat{m}(0,X)}_{\text{outcome predictions}} + \underbrace{\frac{W (Y - \hat{m}(1,X))}{\hat{e}(X)} - \frac{(1-W) (Y - \hat{m}(0,X))}{1-\hat{e}(X)}}_{\text{weighted residuals}} \nonumber \end{align} \]
and uses it in a final regression step.
mwhat0 = mwhat1 = rep(NA,length(Y))
rfm0 = regression_forest(X[W==0,],Y[W==0])
mwhat0[W==0] = predict(rfm0)$predictions
mwhat0[W==1] = predict(rfm0,X[W==1,])$predictions
rfm1 = regression_forest(X[W==1,],Y[W==1])
mwhat1 = predict(rfm1)$predictions
mwhat1[W==1] = predict(rfm1)$predictions
mwhat1[W==0] = predict(rfm1,X[W==0,])$predictions
Y_tilde = mwhat1 - mwhat0 + W * (Y - mwhat1) / ehat - (1 - W) * (Y - mwhat0) / (1-ehat)
cate_dr = predict(regression_forest(X,Y_tilde))$predictions
hist(cate_dr)
Plot the results against each other to see that they are correlated, but far from finding the same predictions.
# Store and plot predictions
results = cbind(cate_sl,cate_tl,cate_cf,cate_rl_rf,cate_dr)
colnames(results) = c("S-learner","T-learner","Causal Forest","R-learner","DR-learner")
pairs.panels(results,method = "pearson")
describe(results)
Implement the R-learner with Lasso and different dictionaries or any other method that solves a weighted least squares problem.
Implement the DR-learner with the
dr_learner()
.
To illustrate how the ideas for validating experiments generalize to the unconfoundedness setting, create a 50:50 sample split:
# Determine the number of rows in X
n_rows <- nrow(X)
# Generate a random vector of indices
indices <- sample(1:n_rows, size = 0.5*n_rows)
# Split the data
X_train <- X[indices,]
X_test <- X[-indices,]
W_train <- W[indices]
W_test <- W[-indices]
Y_train <- Y[indices]
Y_test <- Y[-indices]
Run causal forest in the training sample:
CF = causal_forest(X_train,Y_train,W_train,tune.parameters = "all")
cates = predict(CF,X_test)$predictions
hist(cates)
Disclaimer: Note that the following validation of heterogeneous effect estimates with observational data results from my current reading of the Generic ML paper but has not yet been rigorously assessed in a scientific study.
Get the pseudo-outcome in the test sample:
aipw_test = DML_aipw(Y_test,W_test,X_test)
pseudoY = aipw_test$ATE$delta
and run the BLP as described in the slides:
demeaned_cates = cates - mean(cates)
lm_blp = lm_robust(pseudoY ~ demeaned_cates)
summary(lm_blp)
Call:
lm_robust(formula = pseudoY ~ demeaned_cates)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 11535.700 1490.5102 7.739 1.202e-14 8613.6404 14457.760 4956
demeaned_cates 1.243 0.5167 2.406 1.614e-02 0.2305 2.256 4956
Multiple R-squared: 0.002278 , Adjusted R-squared: 0.002077
F-statistic: 5.791 on 1 and 4956 DF, p-value: 0.01614
This particular training and test split seems to detect systematic effect heterogeneity.
Additionally, we run a GATES analysis as described in the slides with
K=5
.
First create the slices, then regression of the group indicators on pseudo outcome:
K = 5
slices = factor(as.numeric(cut(cates, breaks=quantile(cates, probs=seq(0,1, length = K+1)),include.lowest=TRUE)))
G_ind = model.matrix(~0+slices)
GATES_woc = lm_robust(pseudoY ~ 0 + G_ind)
# Print results
summary(GATES_woc)
Call:
lm_robust(formula = pseudoY ~ 0 + G_ind)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
G_indslices1 3093 1783 1.735 8.282e-02 -402.1 6588 4953
G_indslices2 6921 1700 4.070 4.770e-05 3587.5 10255 4953
G_indslices3 9511 2144 4.436 9.360e-06 5307.9 13714 4953
G_indslices4 17196 4126 4.168 3.123e-05 9108.3 25284 4953
G_indslices5 20958 5271 3.976 7.095e-05 10625.5 31291 4953
Multiple R-squared: 0.01581 , Adjusted R-squared: 0.01482
F-statistic: 14.49 on 5 and 4953 DF, p-value: 4.046e-14
# Plot results
se = GATES_woc$std.error
data.frame(Variable = paste("Group",1:K),
Coefficient = GATES_woc$coefficients,
cil = GATES_woc$coefficients - 1.96*se,
ciu = GATES_woc$coefficients + 1.96*se) %>%
ggplot(aes(x=Variable,y=Coefficient,ymin=cil,ymax=ciu)) + geom_point(linewidth=2.5) +
geom_errorbar(width=0.15) + geom_hline(yintercept=0) + geom_hline(yintercept = lm_blp$coefficients[1], linetype = "dashed")
Warnung: Ignoring unknown parameters: `linewidth`
We see a clear monotonic relationship that adds to the BLP evidence that the causal forest detected systematic heterogeneity.
Finally, we test \(H_0:\gamma_5 - \gamma_1 = 0\) by adding the constant to the GATES regression such that Group 1 becomes the reference group:
GATES_wc = lm_robust(pseudoY ~ G_ind[,-1])
summary(GATES_wc)
Call:
lm_robust(formula = pseudoY ~ G_ind[, -1])
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 3093 1783 1.735 0.082824 -402.1 6588 4953
G_ind[, -1]slices2 3828 2464 1.554 0.120274 -1001.6 8658 4953
G_ind[, -1]slices3 6418 2788 2.302 0.021387 951.9 11885 4953
G_ind[, -1]slices4 14103 4494 3.138 0.001711 5292.6 22914 4953
G_ind[, -1]slices5 17865 5564 3.211 0.001332 6957.6 28773 4953
Multiple R-squared: 0.003943 , Adjusted R-squared: 0.003139
F-statistic: 4.785 on 4 and 4953 DF, p-value: 0.0007482
Given the clear pattern in the graph it is not surprising that we can reject the null suggesting once more that causal forest finds systematic heterogeneity.
After finding evidence that the heterogeneity is systematic, we want to understand which covariates are most predictive of the effect sizes. This is done by comparing covariate means across the sorted groups:
for (i in 1:ncol(X_test)) {
print(colnames(X_test)[i])
print(summary(lm_robust(X_test[,i] ~ slices)))
}
[1] "age"
Call:
lm_robust(formula = X_test[, i] ~ slices)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 31.685 0.1570 201.79 0.000e+00 31.378 31.99 4953
slices2 5.082 0.3098 16.40 6.201e-59 4.475 5.69 4953
slices3 11.175 0.3385 33.02 2.806e-216 10.512 11.84 4953
slices4 15.437 0.3446 44.79 0.000e+00 14.761 16.11 4953
slices5 15.578 0.3166 49.20 0.000e+00 14.957 16.20 4953
Multiple R-squared: 0.3441 , Adjusted R-squared: 0.3436
F-statistic: 980.1 on 4 and 4953 DF, p-value: < 2.2e-16
[1] "db"
Call:
lm_robust(formula = X_test[, i] ~ slices)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.04234 0.006396 6.619 3.996e-11 0.02980 0.05488 4953
slices2 0.09086 0.012551 7.239 5.213e-13 0.06625 0.11547 4953
slices3 0.14617 0.013974 10.460 2.418e-25 0.11877 0.17356 4953
slices4 0.31185 0.016491 18.910 4.632e-77 0.27952 0.34418 4953
slices5 0.60081 0.016508 36.395 3.126e-257 0.56844 0.63317 4953
Multiple R-squared: 0.2256 , Adjusted R-squared: 0.225
F-statistic: 380.2 on 4 and 4953 DF, p-value: < 2.2e-16
[1] "educ"
Call:
lm_robust(formula = X_test[, i] ~ slices)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 13.4768 0.08795 153.241 0.000e+00 13.3044 13.6492 4953
slices2 -0.7341 0.12446 -5.898 3.912e-09 -0.9781 -0.4901 4953
slices3 -1.0544 0.12691 -8.309 1.238e-16 -1.3032 -0.8056 4953
slices4 -0.7503 0.12120 -6.191 6.480e-10 -0.9879 -0.5127 4953
slices5 0.9970 0.11835 8.424 4.705e-17 0.7650 1.2290 4953
Multiple R-squared: 0.06924 , Adjusted R-squared: 0.06848
F-statistic: 99.88 on 4 and 4953 DF, p-value: < 2.2e-16
[1] "fsize"
Call:
lm_robust(formula = X_test[, i] ~ slices)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 2.61190 0.04955 52.7130 0.000e+00 2.5148 2.7090 4953
slices2 0.46984 0.07185 6.5390 6.817e-11 0.3290 0.6107 4953
slices3 0.54738 0.07207 7.5949 3.659e-14 0.4061 0.6887 4953
slices4 0.24078 0.06710 3.5884 3.359e-04 0.1092 0.3723 4953
slices5 0.02823 0.06713 0.4205 6.742e-01 -0.1034 0.1598 4953
Multiple R-squared: 0.02046 , Adjusted R-squared: 0.01967
F-statistic: 24.83 on 4 and 4953 DF, p-value: < 2.2e-16
[1] "hown"
Call:
lm_robust(formula = X_test[, i] ~ slices)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.1512 0.01138 13.29 1.303e-39 0.1289 0.1735 4953
slices2 0.3553 0.01954 18.18 1.471e-71 0.3170 0.3937 4953
slices3 0.5867 0.01802 32.56 6.608e-211 0.5514 0.6220 4953
slices4 0.7126 0.01576 45.21 0.000e+00 0.6817 0.7435 4953
slices5 0.7641 0.01441 53.02 0.000e+00 0.7359 0.7924 4953
Multiple R-squared: 0.3384 , Adjusted R-squared: 0.3378
F-statistic: 821.8 on 4 and 4953 DF, p-value: < 2.2e-16
[1] "inc"
Call:
lm_robust(formula = X_test[, i] ~ slices)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 27442.3 792.4 34.630 1.626e-235 25889 28996 4953
slices2 689.8 975.8 0.707 4.796e-01 -1223 2603 4953
slices3 4561.9 1012.7 4.505 6.804e-06 2576 6547 4953
slices4 10919.7 1022.8 10.676 2.543e-26 8915 12925 4953
slices5 32895.9 1141.8 28.810 6.919e-169 30657 35134 4953
Multiple R-squared: 0.2344 , Adjusted R-squared: 0.2338
F-statistic: 308.5 on 4 and 4953 DF, p-value: < 2.2e-16
[1] "male"
Call:
lm_robust(formula = X_test[, i] ~ slices)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.235887 0.01349 17.4908 1.622e-66 0.20945 0.2623263 4953
slices2 -0.036089 0.01853 -1.9476 5.152e-02 -0.07242 0.0002388 4953
slices3 -0.030242 0.01862 -1.6241 1.044e-01 -0.06675 0.0062623 4953
slices4 -0.081498 0.01771 -4.6010 4.310e-06 -0.11622 -0.0467721 4953
slices5 -0.006048 0.01899 -0.3186 7.501e-01 -0.04327 0.0311743 4953
Multiple R-squared: 0.005101 , Adjusted R-squared: 0.004298
F-statistic: 6.987 on 4 and 4953 DF, p-value: 1.322e-05
[1] "marr"
Call:
lm_robust(formula = X_test[, i] ~ slices)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.4617 0.01584 29.154 1.363e-172 0.43065 0.4927 4953
slices2 0.1377 0.02221 6.200 6.119e-10 0.09416 0.1812 4953
slices3 0.1704 0.02203 7.732 1.273e-14 0.12717 0.2136 4953
slices4 0.2215 0.02167 10.221 2.775e-24 0.17898 0.2639 4953
slices5 0.1935 0.02188 8.846 1.243e-18 0.15065 0.2364 4953
Multiple R-squared: 0.02508 , Adjusted R-squared: 0.02429
F-statistic: 30.92 on 4 and 4953 DF, p-value: < 2.2e-16
[1] "pira"
Call:
lm_robust(formula = X_test[, i] ~ slices)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.13810 0.01096 12.6013 7.339e-36 0.11662 0.15959 4953
slices2 -0.01399 0.01516 -0.9225 3.563e-01 -0.04371 0.01574 4953
slices3 0.01815 0.01591 1.1404 2.542e-01 -0.01305 0.04934 4953
slices4 0.15251 0.01812 8.4165 5.022e-17 0.11699 0.18803 4953
slices5 0.34375 0.01929 17.8214 6.573e-69 0.30594 0.38156 4953
Multiple R-squared: 0.1013 , Adjusted R-squared: 0.1005
F-statistic: 113.4 on 4 and 4953 DF, p-value: < 2.2e-16
[1] "twoearn"
Call:
lm_robust(formula = X_test[, i] ~ slices)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.2651 0.01402 18.908 4.769e-77 0.23763 0.2926 4953
slices2 0.0931 0.02071 4.496 7.083e-06 0.05251 0.1337 4953
slices3 0.1149 0.02084 5.514 3.684e-08 0.07406 0.1558 4953
slices4 0.1819 0.02113 8.611 9.644e-18 0.14049 0.2233 4953
slices5 0.1956 0.02115 9.247 3.364e-20 0.15410 0.2370 4953
Multiple R-squared: 0.02088 , Adjusted R-squared: 0.02009
F-statistic: 28.19 on 4 and 4953 DF, p-value: < 2.2e-16
The most and the least affected groups differ substantially along most dimensions (besides male and family size). A general pattern emerges indicating that older and higher income persons have higher effects.