I have following dataset which is modified from birthwt dataset of MASS.

`> str(bwdf) 'data.frame': 189 obs. of 9 variables: $ age : int 19 33 20 21 18 21 22 17 29 26 ... $ lwt : int 182 155 105 108 107 124 118 103 123 113 ... $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ... $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ... $ ptl : int 0 0 0 0 0 0 0 0 0 0 ... $ ht : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ ui : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ... $ ftv : int 0 3 1 2 0 0 1 1 1 0 ... $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ... `

I get following model for bwt as dependent variable and all others as predictors:

`> mod = lm(bwt~., bwdf) > summary(mod) Call: lm(formula = bwt ~ ., data = bwdf) Residuals: Min 1Q Median 3Q Max -1825.26 -435.21 55.91 473.46 1701.20 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2927.962 312.904 9.357 < 0.0000000000000002 *** age -3.570 9.620 -0.371 0.711012 lwt 4.354 1.736 2.509 0.013007 * race2 -488.428 149.985 -3.257 0.001349 ** race3 -355.077 114.753 -3.094 0.002290 ** smoke1 -352.045 106.476 -3.306 0.001142 ** ptl -48.402 101.972 -0.475 0.635607 ht1 -592.827 202.321 -2.930 0.003830 ** ui1 -516.081 138.885 -3.716 0.000271 *** ftv -14.058 46.468 -0.303 0.762598 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 650.3 on 179 degrees of freedom Multiple R-squared: 0.2427, Adjusted R-squared: 0.2047 F-statistic: 6.376 on 9 and 179 DF, p-value: 0.00000007891 `

To see relative importance of predictors I convert all factor predictors also to numeric and then standardize all variables (including dependent variable bwt) using scale() function in R to make mean as 0 and SD as 1. The I get following model, which is very similar to model above:

`> summary(mod2) Residuals: Min 1Q Median 3Q Max -2.49104 -0.58528 0.02234 0.67479 2.26820 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.0000000000000001826 0.0655292090228068308 0.000 1.00000 age -0.0019314510221109923 0.0697181015050541281 -0.028 0.97793 lwt 0.1440511878086048747 0.0712847440204966709 2.021 0.04478 * race -0.2373758768039808398 0.0727076693963745607 -3.265 0.00131 ** smoke -0.2405662228364832678 0.0721568953892822995 -3.334 0.00104 ** ptl -0.0346067011967166188 0.0696837035395908994 -0.497 0.62006 ht -0.2013869038725481508 0.0685136586364621242 -2.939 0.00372 ** ui -0.2497246074479218259 0.0685204480986914971 -3.645 0.00035 *** ftv -0.0225679283419899721 0.0681802240163703471 -0.331 0.74103 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9009 on 180 degrees of freedom Multiple R-squared: 0.223, Adjusted R-squared: 0.1884 F-statistic: 6.456 on 8 and 180 DF, p-value: 0.0000002232 `

I plot its coefficients (Estimate) to see relative importance of different predictor variables:

What are the drawbacks of this approach which includes converting factor variables to standardized numerics? Thanks in advance.

**Contents**hide

#### Best Answer

You almost certainly should not try to standardize factor variables for this purpose.

Most of your factor variables only have 2 levels, so the regression coefficients in your first model simply and directly convey the contribution of each factor to the dependent variable, `bwt`

. That is what most people would expect to learn from a regression, and the most natural comparison among factors. Regression coefficients for standardized versions of those variables would have units of: (change in body weight)/(standard deviation of factor levels). That's much harder to think about.

If you have more than 2 levels for a variable like `race`

, the results of "normalization" will differ depending on the ordering of races in the list of factor levels. You certainly don't want that.

Normalizing a factor variable might make sense if you have multiple levels of a factor that bear a reasonable approximation to an ordered continuous variable. See the extensive discussion on this Cross Validated page, which has links to further discussion.

### Similar Posts:

- Solved – Interpreting explanatory power of linear regression output
- Solved – Fitting a Logistic Regression Without an Intercept
- Solved – Coefficient equality test for multinomial logit (preferably in R)
- Solved – How to interpret these results for a Proportion Test in R
- Solved – Interpreting linear regression with NA for estimates