Solved – How is adding noise to training data equivalent to regularization

I've noticed that some people argue that adding noise to training data equivalent to regularizing our predictor parameters. How is this the case?

  1. Some of the examples listed on SE discussing this topic focus more on e.g. LSTMs and SVMs, but can we do this for simpler models like a multiple linear regression?

  2. How will it affect our parameters' confidence intervals?

  3. Will there be any differences in effects choosing between the various types of white noise, e.g. Gaussian vs uniform white noise?

Adding noise to the regressors in the training data is similar to regularization because it leads to similar results to shrinkage.

The linear regression is an interesting example. Suppose $(Y_i,X_i)_{i=1}^n$ is a set of i.i.d. observations and that $$ Y_i = beta_0 + beta_1X_i + U_i qquad mathbb{E}[U_i mid X_i] = 0 $$ The population coefficient for $beta_1$ is equal to $$ beta_1 = frac{Cov(Y_i,X_i)}{Var(X_i)} $$ The estimated OLS coefficient $hat{beta}_1$ can be written as a sample analog of $beta_1$. Now suppose that we add white noise $Z_i = X_i + varepsilon_i$ and assume that $mathbb{E}[varepsilon_i] = 0$, $Var(varepsilon_i) = sigma^2$, and that that $varepsilon_i$ is independent of $Y_i,X_i$. I have made no other assumption about the distribution of $varepsilon_i$.

Then the population coefficient for a regression of $Y_i$ on $Z_i$ (the noisy regressor) is equal to, $$ tilde{beta}_1 = frac{Cov(Y_i,Z_i)}{Var(Z_i)} = frac{Cov(Y_i,X_i + varepsilon_i)}{Var(X_i + varepsilon_i)} = frac{Cov(Y_i,X_i)}{Var(X_i) + sigma^2} = frac{Var(X_i)}{Var(X_i)+sigma^2} times beta_1 $$ Therefore, $tilde{beta}_1$ shrinks to zero for higher values of $sigma^2$. The estimator for $tilde{beta}_1$ will also shrink to zero. We can use the test data to choose a sequence $sigma_n^2 to 0$ that achieves the optimal bias-variance trade-off via cross-validation.

If you want to do inference, you clearly need to do some form of adjustment both because the estimator is biased and the variance depends on $sigma^2$. The process for selecting $sigma^2$ can also distort the confidence intervals.

Similar Posts:

Rate this post

Leave a Comment