Which statistical model is appropriate when the response is continuous and the predictors are a mix of continuous and categorical? What is the disadvantage in using GLM combined with gaussian family?

Here is my dataset and model in `R`

:

`df <- structure(list(as.factor.pred. = structure(c(1L, 1L, 5L, 3L, 2L, 8L, 3L, 5L, 2L, 2L, 3L, 2L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 4L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 1L, 7L, 3L, 3L, 6L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("A", "B", "C", "D", "E", "F", "G", "H"), class = "factor"), res = c(33, 33, 37, 32, 32, 26, 33, 28, 25, 34, 29, 35, 26, 20, 27, 19, 30, 33, 27, 24, 26, 28, 27, 23, 26, 25, 24, 26, 24, 25, 21, 21, 23, 24, 23, 27, 23, 20, 21, 22, 22, 22, 22, 23, 23, 21, 22, 21, 21, 23, 23, 18, 20, 18, 18, 18, 19)), .Names = c("as.factor.pred.", "res"), row.names = c(NA, -57L), class = "data.frame") names(df)[1] <- "pred" ## fix up the names to match formula model <- glm(res ~ pred, data = df) summary(model) Call: glm(formula = res ~ pred, data = df) Deviance Residuals: Min 1Q Median 3Q Max -6.625 -3.000 -0.625 2.000 10.000 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.300e+01 9.080e-01 25.331 <2e-16 *** predB 2.625e+00 1.471e+00 1.784 0.0806 . predC 4.714e+00 1.971e+00 2.391 0.0207 * predD 3.500e+00 3.397e+00 1.030 0.3080 predE 6.333e+00 2.823e+00 2.243 0.0294 * predF -8.283e-15 4.718e+00 0.000 1.0000 predG 1.000e+00 4.718e+00 0.212 0.8330 predH 3.000e+00 4.718e+00 0.636 0.5278 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 21.43562) Null deviance: 1277.6 on 56 degrees of freedom Residual deviance: 1050.3 on 49 degrees of freedom AIC: 345.85 `

**Contents**hide

#### Best Answer

If the model you wish to fit is linear in its parameters and the errors are Gaussian with constant variance then a linear model would be a reasonable start, via the `lm()`

function for example in R.

As the linear model is a special case of the GLM there is no real difference, but minor differences may show up in the implementation of the two models due to differences in their algorithms. Fitting that same model in R via `glm()`

should give the same fit (coefficients) up to machine precision or some small differences in the last few decimal places. However, fitting via `glm(...., family = gaussian)`

would be exceedingly inefficient compared to fitting via `lm()`

.

Note the `lm()`

function in R fits the so-called General Linear Model, the fusion of "regression" and ANOVA. Hence it is fully capable of dealing with continuous and factor variables.

The above is conditional upon the distribution of the errors and hence the response. You'll need to specify more about the exact problem for a more informed response.

## Update

In light of the OP posting data, we can show the equivalence. In the below, `model`

is as per the OP's question, whilst `model2`

is the same model fitted via `lm()`

instead of `glm()`

`> anova(model, test = "F") Analysis of Deviance Table Model: gaussian, link: identity Response: res Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev F Pr(>F) NULL 56 1277.6 pred 7 227.23 49 1050.3 1.5144 0.1846 > anova(model2) Analysis of Variance Table Response: res Df Sum Sq Mean Sq F value Pr(>F) pred 7 227.23 32.462 1.5144 0.1846 Residuals 49 1050.35 21.436 `

Notice the `Deviance`

of the model is the same as the sums of squares in `model2`

and the rest of the important numbers, `F`

and its p-value are the same. Likewise, the estimated values of rht model coefficients are the same

`> coef(model) (Intercept) predB predC predD predE 2.300000e+01 2.625000e+00 4.714286e+00 3.500000e+00 6.333333e+00 predF predG predH -8.283393e-15 1.000000e+00 3.000000e+00 > coef(model2) (Intercept) predB predC predD predE 2.300000e+01 2.625000e+00 4.714286e+00 3.500000e+00 6.333333e+00 predF predG predH -8.283393e-15 1.000000e+00 3.000000e+00 > all.equal(coef(model), coef(model2)) [1] TRUE `

There appears to be some small differences between between groups A and C and A and E but overall `pred`

does not explain a significant amount of variance in the response.

### Similar Posts:

- Solved – Model for continuous response and a mix of continuous and categorical predictors
- Solved – Model for continuous response and a mix of continuous and categorical predictors
- Solved – Logistic Regression using two categorical variables
- Solved – Logistic Regression using two categorical variables
- Solved – Logistic Regression using two categorical variables