Which statistical model is appropriate when the response is continuous and the predictors are a mix of continuous and categorical? What is the disadvantage in using GLM combined with gaussian family?
Here is my dataset and model in R
:
df <- structure(list(as.factor.pred. = structure(c(1L, 1L, 5L, 3L, 2L, 8L, 3L, 5L, 2L, 2L, 3L, 2L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 4L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 1L, 7L, 3L, 3L, 6L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("A", "B", "C", "D", "E", "F", "G", "H"), class = "factor"), res = c(33, 33, 37, 32, 32, 26, 33, 28, 25, 34, 29, 35, 26, 20, 27, 19, 30, 33, 27, 24, 26, 28, 27, 23, 26, 25, 24, 26, 24, 25, 21, 21, 23, 24, 23, 27, 23, 20, 21, 22, 22, 22, 22, 23, 23, 21, 22, 21, 21, 23, 23, 18, 20, 18, 18, 18, 19)), .Names = c("as.factor.pred.", "res"), row.names = c(NA, -57L), class = "data.frame") names(df)[1] <- "pred" ## fix up the names to match formula model <- glm(res ~ pred, data = df) summary(model) Call: glm(formula = res ~ pred, data = df) Deviance Residuals: Min 1Q Median 3Q Max -6.625 -3.000 -0.625 2.000 10.000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.300e+01 9.080e-01 25.331 <2e-16 *** predB 2.625e+00 1.471e+00 1.784 0.0806 . predC 4.714e+00 1.971e+00 2.391 0.0207 * predD 3.500e+00 3.397e+00 1.030 0.3080 predE 6.333e+00 2.823e+00 2.243 0.0294 * predF -8.283e-15 4.718e+00 0.000 1.0000 predG 1.000e+00 4.718e+00 0.212 0.8330 predH 3.000e+00 4.718e+00 0.636 0.5278 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 21.43562) Null deviance: 1277.6 on 56 degrees of freedom Residual deviance: 1050.3 on 49 degrees of freedom AIC: 345.85
Best Answer
If the model you wish to fit is linear in its parameters and the errors are Gaussian with constant variance then a linear model would be a reasonable start, via the lm()
function for example in R.
As the linear model is a special case of the GLM there is no real difference, but minor differences may show up in the implementation of the two models due to differences in their algorithms. Fitting that same model in R via glm()
should give the same fit (coefficients) up to machine precision or some small differences in the last few decimal places. However, fitting via glm(...., family = gaussian)
would be exceedingly inefficient compared to fitting via lm()
.
Note the lm()
function in R fits the so-called General Linear Model, the fusion of "regression" and ANOVA. Hence it is fully capable of dealing with continuous and factor variables.
The above is conditional upon the distribution of the errors and hence the response. You'll need to specify more about the exact problem for a more informed response.
Update
In light of the OP posting data, we can show the equivalence. In the below, model
is as per the OP's question, whilst model2
is the same model fitted via lm()
instead of glm()
> anova(model, test = "F") Analysis of Deviance Table Model: gaussian, link: identity Response: res Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev F Pr(>F) NULL 56 1277.6 pred 7 227.23 49 1050.3 1.5144 0.1846 > anova(model2) Analysis of Variance Table Response: res Df Sum Sq Mean Sq F value Pr(>F) pred 7 227.23 32.462 1.5144 0.1846 Residuals 49 1050.35 21.436
Notice the Deviance
of the model is the same as the sums of squares in model2
and the rest of the important numbers, F
and its p-value are the same. Likewise, the estimated values of rht model coefficients are the same
> coef(model) (Intercept) predB predC predD predE 2.300000e+01 2.625000e+00 4.714286e+00 3.500000e+00 6.333333e+00 predF predG predH -8.283393e-15 1.000000e+00 3.000000e+00 > coef(model2) (Intercept) predB predC predD predE 2.300000e+01 2.625000e+00 4.714286e+00 3.500000e+00 6.333333e+00 predF predG predH -8.283393e-15 1.000000e+00 3.000000e+00 > all.equal(coef(model), coef(model2)) [1] TRUE
There appears to be some small differences between between groups A and C and A and E but overall pred
does not explain a significant amount of variance in the response.
Similar Posts:
- Solved – Model for continuous response and a mix of continuous and categorical predictors
- Solved – Model for continuous response and a mix of continuous and categorical predictors
- Solved – Logistic Regression using two categorical variables
- Solved – Logistic Regression using two categorical variables
- Solved – Logistic Regression using two categorical variables