Why linear regression and Generalized Model have inconsistent assumptions?
- In linear regression, we assume residual comes form Gaussian
- In other regression (logistic regression, poison regression), we assume response comes form some distribution (binomial, poission etc.).
Why sometimes assume residual and other time assume on response? Is is because we want to derive different properties?
EDIT: I think mark999's shows two forms are equal. However, I do have one additional doubts on i.i.d:
My other quesiton,
Is there i.i.d. assumption on logistic regression? shows generalized linear model does not have i.i.d assumption (independent but not identical)
Is that true that for linear regression, if we pose assumption on residual, we will have i.i.d, but if we pose assumption on response, we will have independent but not identical samples (different Gaussian with different $mu$)?
Best Answer
Simple linear regression having Gaussian errors is a very nice attribute that does not generalize to generalized linear models.
In generalized linear models, the response follows some given distribution given the mean. Linear regression follows this pattern; if we have
$y_i = beta_0 + beta_1 x_i + epsilon_i$
with $epsilon_i sim N(0, sigma)$
then we also have
$y_i sim N(beta_0 + beta_1 x_i, sigma)$
Okay, so the response follows the given distribution for generalized linear models, but for linear regression we also have that the residuals follow a Gaussian distribution. Why is it emphasized that the residuals are normal when that's not the generalized rule? Well, because it's the much more useful rule. The nice thing about thinking about normality of the residuals is this is much easier to examine. If we subtract out the estimated means, all the residuals should have roughly the same variance and roughly the same mean (0) and will be roughly normally distributed (note: I say "roughly" because if we don't have perfect estimates of the regression parameters, which of course we do not, the variance of the estimates of $epsilon_i$ will have different variances based on the ranges of $x$. But hopefully there's enough precision in the estimates that this is ignorable!).
On the other hand, looking at the unadjusted $y_i$'s, we can't really tell if they are normal if they all have different means. For example, consider the following model:
$y_i = 0 + 2 times x_i + epsilon_i$
with $epsilon_i sim N(0, 0.2)$ and $x_i sim text{Bernoulli}(p = 0.5)$
Then the $y_i$ will be highly bimodal, but does not violate the assumptions of linear regression! On the other hand, the residuals will follow a roughly normal distribution.
Here's some R
code to illustrate.
x <- rbinom(1000, size = 1, prob = 0.5) y <- 2 * x + rnorm(1000, sd = 0.2) fit <- lm(y ~ x) resids <- residuals(fit) par(mfrow = c(1,2)) hist(y, main = 'Distribution of Responses') hist(resids, main = 'Distribution of Residuals')
Similar Posts:
- Solved – Linear regression and assumptions about response variable
- Solved – Linear regression and assumptions about response variable
- Solved – Linear regression and assumptions about response variable
- Solved – Linear regression and assumptions about response variable
- Solved – Relationship between noise term ($epsilon$) and MLE solution for Linear Regression Models