I have following dataset which is modified from birthwt dataset of MASS.
> str(bwdf) 'data.frame': 189 obs. of 9 variables: $ age : int 19 33 20 21 18 21 22 17 29 26 ... $ lwt : int 182 155 105 108 107 124 118 103 123 113 ... $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ... $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ... $ ptl : int 0 0 0 0 0 0 0 0 0 0 ... $ ht : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ ui : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ... $ ftv : int 0 3 1 2 0 0 1 1 1 0 ... $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
I get following model for bwt as dependent variable and all others as predictors:
> mod = lm(bwt~., bwdf) > summary(mod) Call: lm(formula = bwt ~ ., data = bwdf) Residuals: Min 1Q Median 3Q Max -1825.26 -435.21 55.91 473.46 1701.20 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2927.962 312.904 9.357 < 0.0000000000000002 *** age -3.570 9.620 -0.371 0.711012 lwt 4.354 1.736 2.509 0.013007 * race2 -488.428 149.985 -3.257 0.001349 ** race3 -355.077 114.753 -3.094 0.002290 ** smoke1 -352.045 106.476 -3.306 0.001142 ** ptl -48.402 101.972 -0.475 0.635607 ht1 -592.827 202.321 -2.930 0.003830 ** ui1 -516.081 138.885 -3.716 0.000271 *** ftv -14.058 46.468 -0.303 0.762598 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 650.3 on 179 degrees of freedom Multiple R-squared: 0.2427, Adjusted R-squared: 0.2047 F-statistic: 6.376 on 9 and 179 DF, p-value: 0.00000007891
To see relative importance of predictors I convert all factor predictors also to numeric and then standardize all variables (including dependent variable bwt) using scale() function in R to make mean as 0 and SD as 1. The I get following model, which is very similar to model above:
> summary(mod2) Residuals: Min 1Q Median 3Q Max -2.49104 -0.58528 0.02234 0.67479 2.26820 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.0000000000000001826 0.0655292090228068308 0.000 1.00000 age -0.0019314510221109923 0.0697181015050541281 -0.028 0.97793 lwt 0.1440511878086048747 0.0712847440204966709 2.021 0.04478 * race -0.2373758768039808398 0.0727076693963745607 -3.265 0.00131 ** smoke -0.2405662228364832678 0.0721568953892822995 -3.334 0.00104 ** ptl -0.0346067011967166188 0.0696837035395908994 -0.497 0.62006 ht -0.2013869038725481508 0.0685136586364621242 -2.939 0.00372 ** ui -0.2497246074479218259 0.0685204480986914971 -3.645 0.00035 *** ftv -0.0225679283419899721 0.0681802240163703471 -0.331 0.74103 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9009 on 180 degrees of freedom Multiple R-squared: 0.223, Adjusted R-squared: 0.1884 F-statistic: 6.456 on 8 and 180 DF, p-value: 0.0000002232
I plot its coefficients (Estimate) to see relative importance of different predictor variables:
What are the drawbacks of this approach which includes converting factor variables to standardized numerics? Thanks in advance.
Best Answer
You almost certainly should not try to standardize factor variables for this purpose.
Most of your factor variables only have 2 levels, so the regression coefficients in your first model simply and directly convey the contribution of each factor to the dependent variable, bwt
. That is what most people would expect to learn from a regression, and the most natural comparison among factors. Regression coefficients for standardized versions of those variables would have units of: (change in body weight)/(standard deviation of factor levels). That's much harder to think about.
If you have more than 2 levels for a variable like race
, the results of "normalization" will differ depending on the ordering of races in the list of factor levels. You certainly don't want that.
Normalizing a factor variable might make sense if you have multiple levels of a factor that bear a reasonable approximation to an ordered continuous variable. See the extensive discussion on this Cross Validated page, which has links to further discussion.
Similar Posts:
- Solved – Interpreting explanatory power of linear regression output
- Solved – Fitting a Logistic Regression Without an Intercept
- Solved – Coefficient equality test for multinomial logit (preferably in R)
- Solved – How to interpret these results for a Proportion Test in R
- Solved – Interpreting linear regression with NA for estimates