Solved – Model for continuous response and a mix of continuous and categorical predictors

Which statistical model is appropriate when the response is continuous and the predictors are a mix of continuous and categorical? What is the disadvantage in using GLM combined with gaussian family?

Here is my dataset and model in R:

df <- structure(list(as.factor.pred. = structure(c(1L, 1L, 5L, 3L,  2L, 8L, 3L, 5L, 2L, 2L, 3L, 2L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,  1L, 2L, 4L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 1L, 7L, 3L, 3L,  6L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 2L, 2L, 1L, 1L, 2L, 1L, 1L,  1L, 1L, 1L, 1L, 2L), .Label = c("A", "B", "C", "D", "E", "F",  "G", "H"), class = "factor"), res = c(33, 33, 37, 32, 32, 26,  33, 28, 25, 34, 29, 35, 26, 20, 27, 19, 30, 33, 27, 24, 26, 28,  27, 23, 26, 25, 24, 26, 24, 25, 21, 21, 23, 24, 23, 27, 23, 20,  21, 22, 22, 22, 22, 23, 23, 21, 22, 21, 21, 23, 23, 18, 20, 18,  18, 18, 19)), .Names = c("as.factor.pred.", "res"), row.names = c(NA,  -57L), class = "data.frame")  names(df)[1] <- "pred" ## fix up the names to match formula  model <- glm(res ~ pred, data = df)  summary(model)  Call: glm(formula = res ~ pred, data = df)  Deviance Residuals:     Min      1Q  Median      3Q     Max   -6.625  -3.000  -0.625   2.000  10.000    Coefficients:               Estimate Std. Error t value Pr(>|t|)     (Intercept)  2.300e+01  9.080e-01  25.331   <2e-16 *** predB        2.625e+00  1.471e+00   1.784   0.0806 .   predC        4.714e+00  1.971e+00   2.391   0.0207 *   predD        3.500e+00  3.397e+00   1.030   0.3080     predE        6.333e+00  2.823e+00   2.243   0.0294 *   predF       -8.283e-15  4.718e+00   0.000   1.0000     predG        1.000e+00  4.718e+00   0.212   0.8330     predH        3.000e+00  4.718e+00   0.636   0.5278     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  (Dispersion parameter for gaussian family taken to be 21.43562)      Null deviance: 1277.6  on 56  degrees of freedom Residual deviance: 1050.3  on 49  degrees of freedom AIC: 345.85   

If the model you wish to fit is linear in its parameters and the errors are Gaussian with constant variance then a linear model would be a reasonable start, via the lm() function for example in R.

As the linear model is a special case of the GLM there is no real difference, but minor differences may show up in the implementation of the two models due to differences in their algorithms. Fitting that same model in R via glm() should give the same fit (coefficients) up to machine precision or some small differences in the last few decimal places. However, fitting via glm(...., family = gaussian) would be exceedingly inefficient compared to fitting via lm().

Note the lm() function in R fits the so-called General Linear Model, the fusion of "regression" and ANOVA. Hence it is fully capable of dealing with continuous and factor variables.

The above is conditional upon the distribution of the errors and hence the response. You'll need to specify more about the exact problem for a more informed response.

Update

In light of the OP posting data, we can show the equivalence. In the below, model is as per the OP's question, whilst model2 is the same model fitted via lm() instead of glm()

> anova(model, test = "F") Analysis of Deviance Table  Model: gaussian, link: identity  Response: res  Terms added sequentially (first to last)        Df Deviance Resid. Df Resid. Dev      F Pr(>F) NULL                    56     1277.6               pred  7   227.23        49     1050.3 1.5144 0.1846 > anova(model2) Analysis of Variance Table  Response: res           Df  Sum Sq Mean Sq F value Pr(>F) pred       7  227.23  32.462  1.5144 0.1846 Residuals 49 1050.35  21.436 

Notice the Deviance of the model is the same as the sums of squares in model2 and the rest of the important numbers, F and its p-value are the same. Likewise, the estimated values of rht model coefficients are the same

> coef(model)   (Intercept)         predB         predC         predD         predE   2.300000e+01  2.625000e+00  4.714286e+00  3.500000e+00  6.333333e+00          predF         predG         predH  -8.283393e-15  1.000000e+00  3.000000e+00  > coef(model2)   (Intercept)         predB         predC         predD         predE   2.300000e+01  2.625000e+00  4.714286e+00  3.500000e+00  6.333333e+00          predF         predG         predH  -8.283393e-15  1.000000e+00  3.000000e+00  > all.equal(coef(model), coef(model2)) [1] TRUE 

There appears to be some small differences between between groups A and C and A and E but overall pred does not explain a significant amount of variance in the response.

Similar Posts:

Rate this post

Leave a Comment

Solved – Model for continuous response and a mix of continuous and categorical predictors

Which statistical model is appropriate when the response is continuous and the predictors are a mix of continuous and categorical? What is the disadvantage in using GLM combined with gaussian family?

Here is my dataset and model in R:

df <- structure(list(as.factor.pred. = structure(c(1L, 1L, 5L, 3L,  2L, 8L, 3L, 5L, 2L, 2L, 3L, 2L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,  1L, 2L, 4L, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 2L, 1L, 7L, 3L, 3L,  6L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 2L, 2L, 1L, 1L, 2L, 1L, 1L,  1L, 1L, 1L, 1L, 2L), .Label = c("A", "B", "C", "D", "E", "F",  "G", "H"), class = "factor"), res = c(33, 33, 37, 32, 32, 26,  33, 28, 25, 34, 29, 35, 26, 20, 27, 19, 30, 33, 27, 24, 26, 28,  27, 23, 26, 25, 24, 26, 24, 25, 21, 21, 23, 24, 23, 27, 23, 20,  21, 22, 22, 22, 22, 23, 23, 21, 22, 21, 21, 23, 23, 18, 20, 18,  18, 18, 19)), .Names = c("as.factor.pred.", "res"), row.names = c(NA,  -57L), class = "data.frame")  names(df)[1] <- "pred" ## fix up the names to match formula  model <- glm(res ~ pred, data = df)  summary(model)  Call: glm(formula = res ~ pred, data = df)  Deviance Residuals:     Min      1Q  Median      3Q     Max   -6.625  -3.000  -0.625   2.000  10.000    Coefficients:               Estimate Std. Error t value Pr(>|t|)     (Intercept)  2.300e+01  9.080e-01  25.331   <2e-16 *** predB        2.625e+00  1.471e+00   1.784   0.0806 .   predC        4.714e+00  1.971e+00   2.391   0.0207 *   predD        3.500e+00  3.397e+00   1.030   0.3080     predE        6.333e+00  2.823e+00   2.243   0.0294 *   predF       -8.283e-15  4.718e+00   0.000   1.0000     predG        1.000e+00  4.718e+00   0.212   0.8330     predH        3.000e+00  4.718e+00   0.636   0.5278     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  (Dispersion parameter for gaussian family taken to be 21.43562)      Null deviance: 1277.6  on 56  degrees of freedom Residual deviance: 1050.3  on 49  degrees of freedom AIC: 345.85   

Best Answer

If the model you wish to fit is linear in its parameters and the errors are Gaussian with constant variance then a linear model would be a reasonable start, via the lm() function for example in R.

As the linear model is a special case of the GLM there is no real difference, but minor differences may show up in the implementation of the two models due to differences in their algorithms. Fitting that same model in R via glm() should give the same fit (coefficients) up to machine precision or some small differences in the last few decimal places. However, fitting via glm(...., family = gaussian) would be exceedingly inefficient compared to fitting via lm().

Note the lm() function in R fits the so-called General Linear Model, the fusion of "regression" and ANOVA. Hence it is fully capable of dealing with continuous and factor variables.

The above is conditional upon the distribution of the errors and hence the response. You'll need to specify more about the exact problem for a more informed response.

Update

In light of the OP posting data, we can show the equivalence. In the below, model is as per the OP's question, whilst model2 is the same model fitted via lm() instead of glm()

> anova(model, test = "F") Analysis of Deviance Table  Model: gaussian, link: identity  Response: res  Terms added sequentially (first to last)        Df Deviance Resid. Df Resid. Dev      F Pr(>F) NULL                    56     1277.6               pred  7   227.23        49     1050.3 1.5144 0.1846 > anova(model2) Analysis of Variance Table  Response: res           Df  Sum Sq Mean Sq F value Pr(>F) pred       7  227.23  32.462  1.5144 0.1846 Residuals 49 1050.35  21.436 

Notice the Deviance of the model is the same as the sums of squares in model2 and the rest of the important numbers, F and its p-value are the same. Likewise, the estimated values of rht model coefficients are the same

> coef(model)   (Intercept)         predB         predC         predD         predE   2.300000e+01  2.625000e+00  4.714286e+00  3.500000e+00  6.333333e+00          predF         predG         predH  -8.283393e-15  1.000000e+00  3.000000e+00  > coef(model2)   (Intercept)         predB         predC         predD         predE   2.300000e+01  2.625000e+00  4.714286e+00  3.500000e+00  6.333333e+00          predF         predG         predH  -8.283393e-15  1.000000e+00  3.000000e+00  > all.equal(coef(model), coef(model2)) [1] TRUE 

There appears to be some small differences between between groups A and C and A and E but overall pred does not explain a significant amount of variance in the response.

Similar Posts:

Rate this post

Leave a Comment