Goals:
Illustrate how OLS overfits
Illustrate why we should evaluate predictions out-of-sample
Consider the following linear data generating process (DGP):
\(p\geq10\) independent and standard normal covariates: \(X \sim N(0,I_p)\), where \(I_p\) is the \(p\)-dimensional identity matrix
The outcome model is \(Y = \underbrace{\beta_0 + \beta_1 X_1 + ... + \beta_{10} X_{10}}_{\text{conditional expectation function }m(X)} + e\), where \(X_j\) is the \(j\)-th column of \(X\) and \(e \sim N(0,1)\)
We consider the following parameters \(\beta_0 = 0\), \(\beta_1 = 1\), \(\beta_2 = 0.9\), …, \(\beta_9 = 0.2\), \(\beta_{10} = 0.1\) such that the first 10 variables have a decreasing impact on the outcome and any further variables are just irrelevant noise with no true predictive power
We set \(p=99\) and draw a training sample with \(N_{tr}=100\) and a test sample with \(N_{te}=10,000\)
# Load the packages required for later
if (!require("tidyverse")) install.packages("tidyverse", dependencies = TRUE); library(tidyverse)
if (!require("skimr")) install.packages("skimr", dependencies = TRUE); library(skimr)
set.seed(1234) # For replicability
# Define the important parameters
n_tr = 100
n_te = 10000
p = 99
beta = c(0,seq(1,0.1,-0.1),rep(0,p-10))
# Combine constant and randomly drawn covariates
x_tr = cbind(rep(1,n_tr),matrix(rnorm(n_tr*p),ncol=p))
x_te = cbind(rep(1,n_te),matrix(rnorm(n_te*p),ncol=p))
# Create the CEF using matrix multiplication for compactness
cfe_tr = x_tr %*% beta
cfe_te = x_te %*% beta
# Create the "observed" outcomes by adding noise
y_tr = cfe_tr + rnorm(n_tr,0,1)
y_te = cfe_te + rnorm(n_te,0,1)
Summarize the covariates:
skim(x_tr)
── Data Summary ────────────────────────
Values
Name x_tr
Number of rows 100
Number of columns 100
_______________________
Column type frequency:
numeric 100
________________________
Group variables None
Plot the resulting outcome distribution:
hist(cfe_te)
Imagine that we were told that the true model consists only of the first ten variables. Then we could specify the correct model:
perfect_ols = lm(y_tr ~ x_tr[,2:11])
summary(perfect_ols)
Call:
lm(formula = y_tr ~ x_tr[, 2:11])
Residuals:
Min 1Q Median 3Q Max
-2.4716 -0.6772 -0.1431 0.6921 3.3305
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.08222 0.11159 -0.737 0.4632
x_tr[, 2:11]1 0.99921 0.11132 8.976 4.21e-14 ***
x_tr[, 2:11]2 0.88997 0.11319 7.862 8.31e-12 ***
x_tr[, 2:11]3 0.90388 0.11773 7.678 1.98e-11 ***
x_tr[, 2:11]4 0.72184 0.10824 6.669 2.10e-09 ***
x_tr[, 2:11]5 0.66229 0.10169 6.513 4.27e-09 ***
x_tr[, 2:11]6 0.54058 0.12074 4.477 2.23e-05 ***
x_tr[, 2:11]7 0.25639 0.11920 2.151 0.0342 *
x_tr[, 2:11]8 0.09273 0.11552 0.803 0.4243
x_tr[, 2:11]9 0.24757 0.12604 1.964 0.0526 .
x_tr[, 2:11]10 0.03388 0.11159 0.304 0.7621
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.073 on 89 degrees of freedom
Multiple R-squared: 0.8153, Adjusted R-squared: 0.7945
F-statistic: 39.28 on 10 and 89 DF, p-value: < 2.2e-16
We can use it to produce fitted values in the training sample and predicted values in the test sample and plot them against the observed outcomes:
yhat_pols_tr = predict(perfect_ols)
plot(yhat_pols_tr,y_tr)
yhat_pols_te = x_te[,1:11] %*% perfect_ols$coefficients
plot(yhat_pols_te,y_te)
We observe a clear positive correlation between predicted and actual outcomes \(\Rightarrow\) true model does a good job already with 100 observations.
Next, lets check the in-sample and out-of-sample MSE:
paste("In-sample MSE:", round( mean( (y_tr - yhat_pols_tr)^2),3 ) )
[1] "In-sample MSE: 1.024"
paste("Out-of-sample MSE:", round( mean( (y_te - yhat_pols_te)^2),3 ) )
[1] "Out-of-sample MSE: 1.124"
This indicates that the in-sample MSE understates the true error \(\Rightarrow\) first signs of overfitting, but could also be by chance
Now consider the case where nobody tells us the correct model and we gradually add the variables to the model. We start with a model including a constant and \(X_1\), then add \(X_2\), then \(X_3\), …, and end up with a model with a constant and 99 regressors.
We calculate and store four quantities: the training and test sample MSE that would be observable in a real application \((Y-\hat{m}(X))^2\) and the respective oracle MSEs \((Y-m(X))^2\), where we exploit that we defined and thus know the true CEF.
# Container of the results
results_ols = matrix(NA,p,4)
colnames(results_ols) = c("Obs MSE train","Obs MSE test","Oracle MSE train","Oracle MSE test")
# Loop that gradually adds variables
for (i in 1:p) {
temp_ols = lm(y_tr ~ x_tr[,2:(i+1)])
temp_yhat_tr = predict(temp_ols)
temp_yhat_te = x_te[,1:(i+1)] %*% temp_ols$coefficients
# Calculate the observable MSEs in training and test sample
results_ols[i,1] = mean((y_tr - temp_yhat_tr)^2) # in-sample MSE
results_ols[i,2] = mean((y_te - temp_yhat_te)^2) # out-of-sample MSE
# Calculate the oracle MSEs that are only observables b/c we know the true CEF
results_ols[i,3] = var(y_tr - cfe_tr) + mean((cfe_tr - temp_yhat_tr)^2)
results_ols[i,4] = var(y_te - cfe_te) + mean((cfe_te - temp_yhat_te)^2)
}
Let’s first check how the in-sample MSE develops:
# prepare for plotting
df = data.frame("Number of variables" = 1:p,results_ols)
# Plot training MSE
ggplot(df, aes(Number.of.variables)) +
geom_line(aes(y = Obs.MSE.train, colour = "Obs.MSE.train"),size=1) +
ylab("MSE") + geom_hline(yintercept = 0)
The in-sample MSE drops substantially in the beginning when the relevant variables are added. However, it further decreases gradually as we add more noise variables. Finally, the MSE is exactly zero if as many parameters as observations are estimated.
paste("In-sample MSE with constant and ",p,"predictors:", round(results_ols[p,1]))
[1] "In-sample MSE with constant and 99 predictors: 0"
Remark: This implies also an \(R^2\) of 1 because \(R^2 = 1-MSE/SSE = 1-0/SSE=1\). This also reminds us that optimizing in-sample \(R^2\) when specifying OLS models is a bad guide.
Now let’s plot the training sample MSE from before and the test sample MSE:
# Comparison of observable MSEs
ggplot(df, aes(Number.of.variables)) +
geom_line(aes(y = Obs.MSE.train, colour = "Obs.MSE.train"),size=1) +
geom_line(aes(y = Obs.MSE.test, colour = "Obs.MSE.test"),size=1) + ylab("MSE") +
geom_hline(yintercept = 0)
While the training MSE drops to zero, the test sample MSE explodes when the ratio of number of covariates and observations approaches one.
Let’s cut off the extreme values from the right:
ggplot(df[1:80,], aes(Number.of.variables)) +
geom_line(aes(y = Obs.MSE.train, colour = "Obs.MSE.train"),size=1) +
geom_line(aes(y = Obs.MSE.test, colour = "Obs.MSE.test"),size=1) + ylab("MSE") +
geom_hline(yintercept = 0)
We observe completely different trajectories between training and test MSE after the relevant variables are added.
Finally, let’s check how well the feasible MSE compares to the infeasible oracle MSE:
ggplot(df[1:80,], aes(Number.of.variables)) +
geom_line(aes(y = Obs.MSE.test, colour = "Obs.MSE.test"),size=1) +
geom_line(aes(y = Oracle.MSE.test, colour = "Oracle.MSE.test"),size=1) + ylab("MSE") +
geom_hline(yintercept = 0)
The observable and the oracle MSE are nearly identical. This illustrates why using the observable out-of-sample MSE as proxy for predictive performance is so powerful.
It is easy to optimize in-sample fit for OLS by adding noise variables, but this does not resemble the true prediction performance.
Test samples are crucial to understand the true predictive performance of models.
The observable MSE is nearly identical to the (in applications not observable) oracle MSE.
Feel free to play around with the code. This is useful to sharpen and challenge your understanding of the methods. Think about the consequences of a modification before you run it and check whether the results are in line with your expectation. Some suggestions:
Modify DGP (change betas, change level of noise, introduce correlation between covariates, …)
Modify seed
Change training and test sample size