Solved – Using standardized coefficients for relative importance with factor predictors

I have following dataset which is modified from birthwt dataset of MASS.

> str(bwdf) 'data.frame':   189 obs. of  9 variables: $ age  : int  19 33 20 21 18 21 22 17 29 26 ... $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ... $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ... $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ... $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ... $ ht   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ ui   : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ... $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ... $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ... 

I get following model for bwt as dependent variable and all others as predictors:

> mod = lm(bwt~., bwdf) > summary(mod)  Call: lm(formula = bwt ~ ., data = bwdf)  Residuals:      Min       1Q   Median       3Q      Max  -1825.26  -435.21    55.91   473.46  1701.20   Coefficients:             Estimate Std. Error t value             Pr(>|t|)     (Intercept) 2927.962    312.904   9.357 < 0.0000000000000002 *** age           -3.570      9.620  -0.371             0.711012     lwt            4.354      1.736   2.509             0.013007 *   race2       -488.428    149.985  -3.257             0.001349 **  race3       -355.077    114.753  -3.094             0.002290 **  smoke1      -352.045    106.476  -3.306             0.001142 **  ptl          -48.402    101.972  -0.475             0.635607     ht1         -592.827    202.321  -2.930             0.003830 **  ui1         -516.081    138.885  -3.716             0.000271 *** ftv          -14.058     46.468  -0.303             0.762598     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Residual standard error: 650.3 on 179 degrees of freedom Multiple R-squared:  0.2427,    Adjusted R-squared:  0.2047  F-statistic: 6.376 on 9 and 179 DF,  p-value: 0.00000007891 

To see relative importance of predictors I convert all factor predictors also to numeric and then standardize all variables (including dependent variable bwt) using scale() function in R to make mean as 0 and SD as 1. The I get following model, which is very similar to model above:

> summary(mod2)   Residuals:      Min       1Q   Median       3Q      Max  -2.49104 -0.58528  0.02234  0.67479  2.26820   Coefficients:                           Estimate             Std. Error t value Pr(>|t|)     (Intercept) -0.0000000000000001826  0.0655292090228068308   0.000  1.00000     age         -0.0019314510221109923  0.0697181015050541281  -0.028  0.97793     lwt          0.1440511878086048747  0.0712847440204966709   2.021  0.04478 *   race        -0.2373758768039808398  0.0727076693963745607  -3.265  0.00131 **  smoke       -0.2405662228364832678  0.0721568953892822995  -3.334  0.00104 **  ptl         -0.0346067011967166188  0.0696837035395908994  -0.497  0.62006     ht          -0.2013869038725481508  0.0685136586364621242  -2.939  0.00372 **  ui          -0.2497246074479218259  0.0685204480986914971  -3.645  0.00035 *** ftv         -0.0225679283419899721  0.0681802240163703471  -0.331  0.74103     --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  Residual standard error: 0.9009 on 180 degrees of freedom Multiple R-squared:  0.223,     Adjusted R-squared:  0.1884  F-statistic: 6.456 on 8 and 180 DF,  p-value: 0.0000002232 

I plot its coefficients (Estimate) to see relative importance of different predictor variables:

enter image description here

What are the drawbacks of this approach which includes converting factor variables to standardized numerics? Thanks in advance.

You almost certainly should not try to standardize factor variables for this purpose.

Most of your factor variables only have 2 levels, so the regression coefficients in your first model simply and directly convey the contribution of each factor to the dependent variable, bwt. That is what most people would expect to learn from a regression, and the most natural comparison among factors. Regression coefficients for standardized versions of those variables would have units of: (change in body weight)/(standard deviation of factor levels). That's much harder to think about.

If you have more than 2 levels for a variable like race, the results of "normalization" will differ depending on the ordering of races in the list of factor levels. You certainly don't want that.

Normalizing a factor variable might make sense if you have multiple levels of a factor that bear a reasonable approximation to an ordered continuous variable. See the extensive discussion on this Cross Validated page, which has links to further discussion.

Similar Posts:

Rate this post

Leave a Comment