Goals:
Illustrate Double Selection
Illustrate how single selection fails to produce unbiased estimates and invalid inference
Illustrate how even Double Selection fails with dense DGPs
Consider the following DGP:
\(p=100\) covariates drawn from a multivariate normal distribution: \(X \sim N(0,\Sigma)\), where \(\Sigma\) is a matrix with entries \(\Sigma_{kf}=0.7^{|j-k|}\)
The treatment model is \(W = 1 X_1 + 0.9 X_2 + ... + 0.2 X_{9} + 0.1 X_{10} + \epsilon\), where \(X_j\) is the \(j\)-th column of \(X\) and \(\epsilon \sim N(0,1)\)
The outcome model is \(Y = \theta W + 1 X_1 + 0.9 X_2 + ... + 0.2 X_{9} + 0.1 X_{10} + \varepsilon\), where \(X_j\) is the \(j\)-th column of \(X\) and \(\varepsilon \sim N(0,4)\)
This means, we are in a sparse setting where the same few variables have the same impact on treatment and outcome, but the treatment is easier to estimate (lower noise level).
Furthermore, we set the treatment effect \(\theta = 0\) such that the treatment is not effective. This means that plugging the treatment into the outcome equation produces a reduced form of \(Y = 1 X_1 + 0.9 X_2 + ... + 0.2 X_{9} + 0.1 X_{10} + \varepsilon\). In the notation of the slides we have therefore that \(\beta = \pi\).
if (!require("tidyverse")) install.packages("tidyverse", dependencies = TRUE); library(tidyverse)
if (!require("hdm")) install.packages("hdm", dependencies = TRUE); library(hdm)
if (!require("mvtnorm")) install.packages("mvtnorm", dependencies = TRUE); library(mvtnorm)
if (!require("estimatr")) install.packages("estimatr", dependencies = TRUE); library(estimatr)
set.seed(1234)
n = 100
p = 100
n_rep = 1000
# Define and plot parameters
theta = 0
delta = c(seq(1,0.1,-0.1),rep(0,p-10))
beta = delta
cov_mat = toeplitz(0.7^(0:(p - 1)))
plot(delta)
abline(h = 0)
plot(beta)
abline(h = 0)
Now, we get a first draw and check which variables are selected when running Post-Lasso on the outcome…
# Generate one draw
x = rmvnorm(n = n, mean = rep(0, p), sigma = cov_mat)
w = x %*% delta + rnorm(n,0,1)
y = theta*w + x %*% beta + rnorm(n,0,4)
# Select variables in outcome regression
sel_y = rlasso(x,y)
# Which variables are selected?
which(sel_y$beta != 0)
V1 V3 V4 V5 V6
1 3 4 5 6
and run the single-selection OLS:
# Run single-selection OLS
x_sel_y = x[,sel_y$beta != 0]
summary(lm_robust(y ~ w + x_sel_y))
Call:
lm_robust(formula = y ~ w + x_sel_y)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.2619 0.4036 0.6488 0.51805 -0.53960 1.0633 93
w 0.3503 0.2523 1.3882 0.16838 -0.15077 0.8513 93
x_sel_y1 1.2159 0.5842 2.0812 0.04016 0.05575 2.3760 93
x_sel_y2 -0.4641 0.8097 -0.5731 0.56793 -2.07201 1.1438 93
x_sel_y3 0.5912 0.6499 0.9098 0.36530 -0.69928 1.8817 93
x_sel_y4 0.7715 0.7379 1.0456 0.29845 -0.69375 2.2368 93
x_sel_y5 0.9052 0.4989 1.8143 0.07286 -0.08557 1.8959 93
Multiple R-squared: 0.4852 , Adjusted R-squared: 0.452
F-statistic: 18.17 on 6 and 93 DF, p-value: 7.541e-14
Next, let’s implement Double Selection manually by selecting also variables in the treatment regression…
# Select variables in treatment regression
sel_w = rlasso(x,w)
which(sel_w$beta != 0)
V1 V2 V3 V4 V5 V6 V8
1 2 3 4 5 6 8
and using the union in an OLS (Double Selection). Note that the treatment regression selected \(X_8\), which was previously missed by the outcome regression. The final estimate, however, is only marginally affected in this draw (not a general result):
# Double selection
x_sel_union = x[,sel_y$beta != 0 | sel_w$beta != 0]
summary(lm_robust(y ~ w + x_sel_union))
Call:
lm_robust(formula = y ~ w + x_sel_union)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.2618 0.4067 0.64377 0.52134 -0.54597 1.0696 91
w 0.3418 0.3176 1.07630 0.28464 -0.28900 0.9726 91
x_sel_union1 1.1982 0.6106 1.96229 0.05278 -0.01471 2.4110 91
x_sel_union2 0.0719 0.8634 0.08327 0.93382 -1.64320 1.7870 91
x_sel_union3 -0.4909 0.8927 -0.54991 0.58373 -2.26415 1.2823 91
x_sel_union4 0.6007 0.6984 0.86010 0.39200 -0.78654 1.9878 91
x_sel_union5 0.7708 0.7475 1.03121 0.30518 -0.71398 2.2556 91
x_sel_union6 0.9264 0.6037 1.53456 0.12836 -0.27277 2.1256 91
x_sel_union7 -0.0142 0.4533 -0.03132 0.97508 -0.91462 0.8862 91
Multiple R-squared: 0.4853 , Adjusted R-squared: 0.4401
F-statistic: 13.38 on 8 and 91 DF, p-value: 1.273e-12
Note that the coefficient on w
is exactly the result the
rlassoEffect
command of the hdm
package
provides. The only difference being that it uses slightly different
robust standard errors:
ds = rlassoEffect(x,y,w)
summary(ds)
[1] "Estimates and significance testing of the effect of target variables"
Estimate. Std. Error t value Pr(>|t|)
d1 0.3418 0.3156 1.083 0.279
Lets now repeatedly draw 1000 samples and check the distribution of the resulting coefficient for single and Double Selection:
results = matrix(NA,n_rep,2)
colnames(results) = c("Single","Double")
for (i in 1:n_rep) {
x = rmvnorm(n = n, mean = rep(0, p), sigma = cov_mat)
w = x %*% delta + rnorm(n,0,1)
y = theta*w + x %*% beta + rnorm(n,0,4)
sel_y = rlasso(x,y)
x_sel_y = x[,sel_y$beta != 0]
results[i,1] = lm(y ~ w + x_sel_y)$coefficients[2]
results[i,2] = rlassoEffect(x,y,w)$alpha
}
as.data.frame(results) %>% pivot_longer(cols=everything(),names_to = "Selection",values_to = "coef") %>%
ggplot(aes(x = coef, fill = Selection)) + geom_density(alpha = 0.5) + geom_vline(xintercept = theta)
cat("Bias:\n")
Bias:
round(colMeans(results)-theta,4)
Single Double
0.2501 0.0133
cat("\nMSE:\n")
MSE:
round(colMeans((results-theta)^2),4)
Single Double
0.1386 0.1624
The single selection algorithm is clearly biased but the variance is smaller. This is reflected in the smaller MSE \(\Rightarrow\) The unbiasedness of Double Selection comes at the cost of more variance. However, given that the same variables are important in both equations, the probability to miss variables that are very important in the treatment equation is still relatively small. This is going to change in the next round.
We keep everything the same but reverse the order of the non-zero coefficients in the treatment equation:
The treatment model is \(W = 0.1 X_1 + 0.2 X_2 + ... + 0.9 X_{9} + 1 X_{10} + \epsilon\), where \(X_j\) is the \(j\)-th column of \(X\) and \(\epsilon \sim N(0,1)\)
The outcome model is \(Y = 1 X_1 + 0.9 X_2 + ... + 0.2 X_{9} + 0.1 X_{10} + \varepsilon\), where \(X_j\) is the \(j\)-th column of \(X\) and \(varepsilon \sim N(0,4)\)
Now missing, e.g., \(X_{10}\) because of its small coefficient in \(Y\) should have more severe consequences.
# Define and plot parameters
delta = c(seq(0.1,1,0.1),rep(0,p-10))
beta = c(seq(1,0.1,-0.1),rep(0,p-10))
plot(delta)
abline(h = 0)
plot(beta)
abline(h = 0)
results = matrix(NA,n_rep,2)
colnames(results) = c("Single","Double")
for (i in 1:n_rep) {
x = rmvnorm(n = n, mean = rep(0, p), sigma = cov_mat)
w = x %*% delta + rnorm(n,0,1)
y = theta*w + x %*% beta + rnorm(n,0,4)
sel_y = rlasso(x,y)
x_sel_y = x[,sel_y$beta != 0]
results[i,1] = lm(y ~ w + x_sel_y)$coefficients[2]
results[i,2] = rlassoEffect(x,y,w)$alpha
}
as.data.frame(results) %>% pivot_longer(cols=everything(),names_to = "Selection",values_to = "coef") %>%
ggplot(aes(x = coef, fill = Selection)) + geom_density(alpha = 0.5) + geom_vline(xintercept = theta)
cat("Bias:\n")
Bias:
round(colMeans(results)-theta,4)
Single Double
0.1567 0.0343
cat("\nMSE:\n")
MSE:
round(colMeans((results-theta)^2),4)
Single Double
0.0528 0.1981
The MSE of single selection is only one fourth of the bias of double selection, driven by visibly smaller variance. However, it is also clearly biased while double selection shows negligible bias and we like that more than small MSE when it comes to our target parameter.
This illustrates a crucial difference between supervised ML and causal inference. In the former we are happy if something is predicted with a lower MSE, even if this means to accept some bias. In contrast, for causal inference we usually strive for an unbiased estimate of the causal target parameter, even if this means to accept much higher variance \(\Rightarrow\) Higher MSE
Now let’s make life hard even for double selection. We keep the above DGP but now all variables have at least some impact \(\Rightarrow\) dense setting:
The treatment model is \(W = 0.1 X_1 + 0.2 X_2 + ... + 0.9 X_{9} + 1 X_{10} + 0.1 X_{11} + ... + 0.1 X_{100} + \epsilon\), where \(X_j\) is the \(j\)-th column of \(X\) and \(\epsilon \sim N(0,1)\)
The outcome model is \(Y = 1 X_1 + 0.9 X_2 + ... + 0.2 X_{9} + 0.1 X_{10} + 0.1 X_{11} + ... + 0.1 X_{100} + \varepsilon\), where \(X_j\) is the \(j\)-th column of \(X\) and \(\varepsilon \sim N(0,4)\)
This is a setting where approximate sparsity is not valid.
# Define and plot parameters
delta = c(seq(0.1,1,0.1),rep(0.1,p-10))
beta = c(seq(1,0.1,-0.1),rep(0.1,p-10))
plot(delta, ylim = c(0, 1))
abline(h = 0)
plot(beta, ylim = c(0, 1))
abline(h = 0)
results = matrix(NA,n_rep,2)
colnames(results) = c("Single","Double")
for (i in 1:n_rep) {
x = rmvnorm(n = n, mean = rep(0, p), sigma = cov_mat)
w = x %*% delta + rnorm(n,0,1)
y = theta*w + x %*% beta + rnorm(n,0,4)
sel_y = rlasso(x,y)
x_sel_y = x[,sel_y$beta != 0]
results[i,1] = lm(y ~ w + x_sel_y)$coefficients[2]
results[i,2] = rlassoEffect(x,y,w)$alpha
}
as.data.frame(results) %>% pivot_longer(cols=everything(),names_to = "Selection",values_to = "coef") %>%
ggplot(aes(x = coef, fill = Selection)) + geom_density(alpha = 0.5) + geom_vline(xintercept = theta)
cat("Bias:\n")
Bias:
round(colMeans(results)-theta,4)
Single Double
0.4543 0.5184
cat("\nMSE:\n")
MSE:
round(colMeans((results-theta)^2),4)
Single Double
0.2277 0.4234
Double Selection seems to breaks in the dense setting and becomes worse than single selection also in terms of bias. So should we go for single selection if we suspect that we are in a dense setting? Not if we are interested in doing statistical inference as we see in the next section.
The previous section looked at bias and MSE. Depending on whether you are convinced or not that bias is bad, you might come to the conclusion the Double Selection is overkill and not really required. However, in the following we will see that statistical inference after single selection is very problematic.
To make the point, we run a simulation to check the coverage rates of the confidence intervals for single and double selection. The coverage rate calculate how often the true value is included in the confidence intervals (see ring toss analogy as an intuitive refresher).
The coverage rate can be used to evaluate the quality of the different standard errors. We would like to have nominal coverage, i.e. for a 95% confidence level it should happen in 95% of the replications, for a 90% confidence level in 90% of the replications, …
set.seed(1234)
# Simulation settings
n = 100 # Sample size
p = 100 # Number of covariates
n_rep = 1000 # Number of replications
theta = 0 # True treatment effect
sign_flip = rep(c(1,-1),p/2)
# DGP specifications
cov_mat = toeplitz(0.7^(0:(p - 1)))
sparsity_patterns = list(
same = rbind(c(seq(1, 0.1, -0.1), rep(0, p-10)),
c(seq(1, 0.1, -0.1), rep(0, p-10))),
asymmetric = rbind(c(seq(0.1, 1, 0.1), rep(0, p-10)),
c(seq(1, 0.1, -0.1), rep(0, p-10))),
dense = rbind(c(seq(0.1, 1, 0.1), rep(0.1, p-10)),
c(seq(1, 0.1, -0.1), rep(0.1, p-10)))
)
# Function to simulate data and return coverage
simulate_coverage = function(delta, beta) {
effect = coverage = matrix(NA, n_rep, 2)
colnames(effect) = colnames(coverage) = c("Single", "Double")
for (i in 1:n_rep) {
x = rmvnorm(n, mean = rep(0, p), sigma = cov_mat)
w = x %*% delta + rnorm(n, 0, 1)
y = theta * w + x %*% beta + rnorm(n, 0, 4)
# Single selection
sel_y = rlasso(x, y)
model_single = lm_robust(y ~ w + x[, sel_y$beta != 0])
effect[i,1] = model_single$coefficients[2]
ci_single = confint(model_single)["w", ]
# Double selection
sel_w = rlasso(x, w)
union_selection = sel_y$beta != 0 | sel_w$beta != 0
model_double = lm_robust(y ~ w + x[, union_selection])
effect[i,2] = model_double$coefficients[2]
ci_double = confint(model_double)["w", ]
# Check coverage
coverage[i, 1] = theta >= ci_single[1] & theta <= ci_single[2]
coverage[i, 2] = theta >= ci_double[1] & theta <= ci_double[2]
}
list(bias = colMeans(effect)-theta, mse = colMeans((effect-theta)^2), coverage = colMeans(coverage))
}
# Run simulations for each DGP
results_bias = results_mse = results_cr = tibble(
DGP = c("Same", "Asymmetric", "Dense"),
Single = numeric(length(sparsity_patterns)),
Double = numeric(length(sparsity_patterns))
)
for (i in seq_along(sparsity_patterns)) {
delta = sparsity_patterns[[i]][1,]
beta = sparsity_patterns[[i]][2,]
run = simulate_coverage(delta, beta)
results_bias[i, 2:3] = t(run$bias)
results_mse[i, 2:3] = t(run$mse)
results_cr[i, 2:3] = t(run$coverage)
}
print(results_cr)
# Ensure the order of DGP is maintained in the plot
results_cr$DGP = factor(results_cr$DGP, levels = c("Same", "Asymmetric", "Dense"))
# Convert data from wide to long format for ggplot2
results_long = pivot_longer(results_cr, cols = c(Single, Double), names_to = "Selection", values_to = "CoverageRate")
# Create the bar plot
ggplot(results_long, aes(x = DGP, y = CoverageRate, fill = Selection)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.7)) +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_brewer(palette = "Pastel1", direction = -1) +
labs(title = "Coverage Rates for Single vs Double Selection",
y = "Coverage Rate (%)",
x = "DGP Type",
fill = "Selection Method") +
theme_minimal() +
theme(legend.position = "top") +
geom_hline(yintercept = c(0,0.95,1), linetype = c("solid","dashed","solid"), color = c("black","red","black"), linewidth = 0.7) +
geom_text(aes(label=scales::percent(CoverageRate), group=Selection),
position=position_dodge(width = 0.7), vjust=-0.25)
Double selection shows nominal coverage in the two sparse settings. However, confidence intervals of single selection undercover substantially in all settings. This means we would find significant effects more often than we should. We don’t like that. Even in the dense DGP where double selection showed higher bias, its coverage rate is closer to the nominal rate:
Single selection is wrong and relatively confident about it.
Double selection is even more wrong, but at least reflecting its uncertainty better.
With double selection we wrongly reject the correct null hypothesis of no effect in roughly 35% of the replications why we would expect 5%. Single selection on the other hand rejects in over 90% of the replications. This means, we would find nearly always a significant effect that is not there. We don’t like this.
Single selection produces biased estimates even in sparse settings.
Double Selection successfully removes omitted variable bias by variables that would be missed by single selection.
The price to pay is a higher variance compared to the single selection estimator.
Double Selection blows up if the world is not sparse \(\Rightarrow\) we have to “bet on sparsity”, which is harder to swallow the smaller the sample size
However, if we are interested in statistical inference about the causal target parameter, double selection is still crucial. Especially because statistical inference is more reliable, even in cases where approximate sparsity does not hold.
Feel free to play around with the code. This is useful to sharpen and challenge your understanding of the methods. Think about the consequences of a modifications before you run it and check whether the results are in line with your expectation. Some suggestions:
Modify DGP (correlation of covariates, coefficients, noise term, …)
Vary the number of observations
Implement variable selection mimicking what researchers might do (e.g. based on t-tests)
Change the values of theta to, e.g. 1 or -1. Explain your observations. (Hint: think about the reduced form)