I'm running a multiple linear regression on a set of sports data. When I run the regression on one season, which has 380 data points and which I thought was a fair amount, I get quite a high p-value on one of my independent variables. However, when I run the regression on all my data points (I have more than 3000 data points in total), the p-value decreases from .97 to .02. As I add more data points, the p-value decreases further. My question is: is my variable really significant or am I just decreasing the p-value by adding more data points?
Best Answer
Let's say that your independent variable is $x_i$ and its regression coefficient is $beta_i$. The p-value for $beta_i$ is $P(t<| t^* |)+P(t>|t^*|)$ where $t^*=frac{beta_i}{sqrt{(X'X)^{-1}_{ii}frac{RSS}{n-q}}}$. $RSS$ is the residual sum of squares.
The p-value is large when $|t^*|$ is small, small when $|t^*|$ is large. But when $n$ grows, $RSS/(n-q)$ get smaller and $|t^*|$ larger, so the p-value decreases just because $n$ grows.
This is why "in large samples is more appropriate to choose a size of 1% or less rather than the 'traditional' 5%." (M. Verbeek, A Guide to Modern Econometrics, 3rd edition, §2.5.7, p. 32). If you choose 1%, your coefficient is not statitically significant when $p=0.02$.
Similar Posts:
- Solved – Time series regression coefficient interpretation with differenced independent variable
- Solved – Coefficient Decreases but Standard Errors stay the Same with Inclusion of Control Variables
- Solved – Why can I interpret a log transformed dependent variable in terms of percent change in linear regression
- Solved – How to calculate correlation coefficient from regression cofficient
- Solved – Can the endogenous variable be insignificant while the instrument is significant