Solved – Can we find bounds on R-squared

We know that as the number of independent variables increases, the coefficient of determination $R^2$ will increase but the adjusted $R^2$ may or may not increase. In the following question for the sake of simplicity I shall write only $R^2$ but it must be understood that the question applies to both $R^2$ and adjusted $R^2$. Further we shall make life easier by assuming all the conditions and assumptions of multiple regression are satisfied.

Question: Consider a multiple regression where the the dependent variable $y$ depends on at most $n$ independent variables $x_1, x_2, ldots, x_n$. For a given $k$, $1 le k le n$, we find best $k$ variable liner fit for $y$. Let us denote the coefficient of determination of this best fit by $R_{max}^2(k)$.

Similarly we find the worst possible $k$ variable fit and we denote its coefficient of determination by $R_{min}^2(k)$.

Trivially we have the following bounds

R_{min}^2(k) ge R_{min}^2(1)
R_{max}^2(k) le R_{max}^2(n) = R^2(n).

My question is can we find improve and express the above bounds in terms of non trivial functions involving $n$, $k$, $R_{min}^2(1)$ and $R_{max}^2(n)$. Is any additional assumptions is required to obtain such non trivial bounds?

Motivation: I am currently working on linear modeling where I have a large number of independent variables and I need a way to determine how small or large the coefficient of determination will be for a given $k$. Currently I am following a various algorithmic approaches and writing a programs that gives above bounds. This method is not much useful because despite using the best known algorithms such as leaps computation takes a lot of time as the number of variables increases. Therefore I want to see if a theoretical bound is possible.

My progress so far: Based on heuristic data I have generated using a computer program, I find that $R_{max}^2(k)$ approximately follows a logistic model

R_{max}^2(k) approx frac{R^2(n)}{1+ae^{-bk}}
where $a$ and $b$ are some local constants which depends on the data being analyzed.

I think I understand your problem is one of a model building approach and how to gauge whether a difference in the $R^2$ value, before and after adjusting for a certain variable, is of notable magnitude rather than a consequence of spurious associations in the data. This calls to mind information criteria (AIC and BIC) which encourage parsimonious models by penalizing the likelihood by a number of parameters. In OLS models, the $R^2$ is related to the likelihood, so if the approach stems from this problems, there have been other developments.

To address your specific question: One minimal assumption is that any subsequent covariate must not be collinear with covariates in your provisional model ($k$ taking any one of $1, 2, ldots, p-1$). This assumption allows the inequality to be strict. Nothing further may be said in general. This is because the change $R_{k+1}^2 – R_{k}^2$ will be a function of the residuals $r_k^2$ or design $mathbf{X}_k$ for a provisional model as well as the distribution of $x_{k+1}$. I suspect it is possible to choose sequences of either ${x_{k+1}}_i$ or ${r_k}_i$ such that $R_{k+1}^2 – R_{k}^2 rightarrow_p 0$. To see this, consider $x_{k+1}$ an indicator function for the 1st observation, and take $y_1 rightarrow x_{k+1} mathbf{X}_k beta$. Then $R_{k+1}^2 – R_{k}^2 rightarrow_p 0$. Furthermore, conditioning on all these factors is trivial since it's equivalent to conducting the very inference you are interested in flat-out.

It can be said however, that with a full rank design matrix of rank $n$, you will have $R^2=1$ exactly, but that also is trivial and does not shed any light on any of the previous covariates. If you were interested in the random process of including subsequent, unrelated vectors to a model there might be some specific derivations under a number of highly sensitive assumptions like multivariate normality, independence or orthogonality.

Similar Posts:

Rate this post

Leave a Comment