While reading a paper, I came across the statement
This prediction function will be
parameterized by a parameter vector $theta$ in a parameter space
$Theta$. Often, this prediction function will be over-parameterized and two parameters $(theta, theta') in Theta^2$ that yield the same prediction function everywhere, $forall x in mathscr{X}, f_theta(x)=f_{theta'}(x)$, are called observationally equivalent.
I thought the general notion of over-parameterization, at least within deep learning, was that of overfitting the parameters. However, the authors of the paper seem to be talking about a different idea. Are these two notions the same thing in different words or are they describing two completely different concepts?
Best Answer
The two concepts are related. Over-parametrization (which means having more model parameters than necessary) means that we are fitting a richer model than necessary.
For example, given a true model $Y = X + epsilon$, we might try the following two models to explain/predict $y$ using $x$:
$Y= theta_1 X + epsilon$
and
$Y = theta_1 X + theta_2 X^2 +epsilon$
The second model is over-parametrized.
In practice, $y$ and $x$ data will be noisy (either measurement errors, or because true data is unobservable and we are using proxies for $Y$ and $X$). So, the likelihood of the second model having higher in-sample fit is high. This is because the square term will help fit the sample noise well. But this will lead to worse model performance out of the sample (as noise, most likely, is independent of $X$ in population). So, in general, over-parametrization will lead to overfitting.
Similar Posts:
- Solved – Calculate the autocorr. function of ARMA process
- Solved – Difference between Cholesky decomposition and log-cholesky Decomposition
- Solved – Why is the penalty term added instead of subtracting it from loss term in regularization
- Solved – PACF MA(1) via correlation of prediction errors
- Solved – distribution parameterization