Solved – Different notions of over-parameterization

While reading a paper, I came across the statement

This prediction function will be
parameterized by a parameter vector $theta$ in a parameter space
$Theta$. Often, this prediction function will be over-parameterized and two parameters $(theta, theta') in Theta^2$ that yield the same prediction function everywhere, $forall x in mathscr{X}, f_theta(x)=f_{theta'}(x)$, are called observationally equivalent.

I thought the general notion of over-parameterization, at least within deep learning, was that of overfitting the parameters. However, the authors of the paper seem to be talking about a different idea. Are these two notions the same thing in different words or are they describing two completely different concepts?

The two concepts are related. Over-parametrization (which means having more model parameters than necessary) means that we are fitting a richer model than necessary.

For example, given a true model $Y = X + epsilon$, we might try the following two models to explain/predict $y$ using $x$:

$Y= theta_1 X + epsilon$

and

$Y = theta_1 X + theta_2 X^2 +epsilon$

The second model is over-parametrized.

In practice, $y$ and $x$ data will be noisy (either measurement errors, or because true data is unobservable and we are using proxies for $Y$ and $X$). So, the likelihood of the second model having higher in-sample fit is high. This is because the square term will help fit the sample noise well. But this will lead to worse model performance out of the sample (as noise, most likely, is independent of $X$ in population). So, in general, over-parametrization will lead to overfitting.

Similar Posts:

Rate this post

Leave a Comment