# Solved – Using standardized coefficients for relative importance with factor predictors

I have following dataset which is modified from birthwt dataset of MASS.

``> str(bwdf) 'data.frame':   189 obs. of  9 variables: \$ age  : int  19 33 20 21 18 21 22 17 29 26 ... \$ lwt  : int  182 155 105 108 107 124 118 103 123 113 ... \$ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ... \$ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ... \$ ptl  : int  0 0 0 0 0 0 0 0 0 0 ... \$ ht   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... \$ ui   : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ... \$ ftv  : int  0 3 1 2 0 0 1 1 1 0 ... \$ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ... ``

I get following model for bwt as dependent variable and all others as predictors:

``> mod = lm(bwt~., bwdf) > summary(mod)  Call: lm(formula = bwt ~ ., data = bwdf)  Residuals:      Min       1Q   Median       3Q      Max  -1825.26  -435.21    55.91   473.46  1701.20   Coefficients:             Estimate Std. Error t value             Pr(>|t|)     (Intercept) 2927.962    312.904   9.357 < 0.0000000000000002 *** age           -3.570      9.620  -0.371             0.711012     lwt            4.354      1.736   2.509             0.013007 *   race2       -488.428    149.985  -3.257             0.001349 **  race3       -355.077    114.753  -3.094             0.002290 **  smoke1      -352.045    106.476  -3.306             0.001142 **  ptl          -48.402    101.972  -0.475             0.635607     ht1         -592.827    202.321  -2.930             0.003830 **  ui1         -516.081    138.885  -3.716             0.000271 *** ftv          -14.058     46.468  -0.303             0.762598     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Residual standard error: 650.3 on 179 degrees of freedom Multiple R-squared:  0.2427,    Adjusted R-squared:  0.2047  F-statistic: 6.376 on 9 and 179 DF,  p-value: 0.00000007891 ``

To see relative importance of predictors I convert all factor predictors also to numeric and then standardize all variables (including dependent variable bwt) using scale() function in R to make mean as 0 and SD as 1. The I get following model, which is very similar to model above:

``> summary(mod2)   Residuals:      Min       1Q   Median       3Q      Max  -2.49104 -0.58528  0.02234  0.67479  2.26820   Coefficients:                           Estimate             Std. Error t value Pr(>|t|)     (Intercept) -0.0000000000000001826  0.0655292090228068308   0.000  1.00000     age         -0.0019314510221109923  0.0697181015050541281  -0.028  0.97793     lwt          0.1440511878086048747  0.0712847440204966709   2.021  0.04478 *   race        -0.2373758768039808398  0.0727076693963745607  -3.265  0.00131 **  smoke       -0.2405662228364832678  0.0721568953892822995  -3.334  0.00104 **  ptl         -0.0346067011967166188  0.0696837035395908994  -0.497  0.62006     ht          -0.2013869038725481508  0.0685136586364621242  -2.939  0.00372 **  ui          -0.2497246074479218259  0.0685204480986914971  -3.645  0.00035 *** ftv         -0.0225679283419899721  0.0681802240163703471  -0.331  0.74103     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Residual standard error: 0.9009 on 180 degrees of freedom Multiple R-squared:  0.223,     Adjusted R-squared:  0.1884  F-statistic: 6.456 on 8 and 180 DF,  p-value: 0.0000002232 ``

I plot its coefficients (Estimate) to see relative importance of different predictor variables:

What are the drawbacks of this approach which includes converting factor variables to standardized numerics? Thanks in advance.

Contents

#### Best Answer

You almost certainly should not try to standardize factor variables for this purpose.

Most of your factor variables only have 2 levels, so the regression coefficients in your first model simply and directly convey the contribution of each factor to the dependent variable, `bwt`. That is what most people would expect to learn from a regression, and the most natural comparison among factors. Regression coefficients for standardized versions of those variables would have units of: (change in body weight)/(standard deviation of factor levels). That's much harder to think about.

If you have more than 2 levels for a variable like `race`, the results of "normalization" will differ depending on the ordering of races in the list of factor levels. You certainly don't want that.

Normalizing a factor variable might make sense if you have multiple levels of a factor that bear a reasonable approximation to an ordered continuous variable. See the extensive discussion on this Cross Validated page, which has links to further discussion.

Rate this post