I'm looking at the Lending Club data from Kaggle and I'm just building a pretty simple model to predict defaults.

The data has a large amount of both continuous and categorical variables (I have converted them to binary variables) in it, and now I want to look at doing some dimensionality reduction on it.

Whether it will add any value I'm unsure, but the sheer number of continuous variables makes me feel like an approach like this could be useful.

**Background:** I'm familiar with the PCA approach but not any other approaches that deal with categorical data.

So in terms of sticking with my strengths, **could I use the PCA approach to reduce the dimensionality of the continuous data**, combine the dimensionality reduced data with the dummy data and then run my model over the top?

I know we might be getting into some scaling issues (as the PCA will require scaling but the dummy variables won't), but **are there any theoretical issues you can see cropping up?**

**Contents**hide

#### Best Answer

If your interest is in data reduction, PCA, LASSO and ridge regression can all handle categorical predictors in principle. The default is typically to code the dummies as 0/1 numeric variables and standardize them like continuous numeric variables for scaling. Principal-components regression and ridge regression are fundamentally similar, with an all-or-none choice of components in the former and a graded combination produced by the latter.

Your idea to do PCA on the continuous variables and then combine with the dummy data, however, is also reasonable, as I said in a recent answer. As @amoeba noted in a comment on the present page, however, in either case it's true that: "Whether it's going to end up being useful, nobody can say in advance."

For example, for scaling of categorical variables coded 0/1, Frank Harrell notes in Regression Modeling Strategies, second edition, page 209 that "high prevalence cells [get] more shrinkage than low prevalence ones because the high prevalence cells will dominate the penalty function." That might or might not pose a problem for you.

And as he says on the following page:

For a categorical predictor having c levels, users of ridge regression often do not recognize that the amount of shrinkage and the predicted values from the fitted model depend on how the design matrix is coded. For example, one will get different predictions depending on which cell is chosen as the reference cell when constructing dummy variables.

So if you've pre-coded a multi-level categorical variable already as multiple binaries, you have to consider that issue. Penalizing the squared difference of all *k* regression coefficients for a k-level categorical variable from the mean of the coefficients (including a coefficient of 0 for the reference level) can help with that, as he points out.

An alternative you might consider is to use penalized maximum likelihood, where instead of penalizing the regression coefficients of pre-standardized predictor variables, you maximize $$ log L – frac{1}{2}lambdasum_{i=1}^{p}left(s_ibeta_iright)^2$$ where each $s_i$ is chosen to make $s_ibeta_i$ unitless and $lambda$ is the penalty. That allows work in the original scale of variables, and if there are some variables you don't want to penalize you can just set their scale factors to 0 in the penalized likelihood. Harrell's `rms`

package in R provides for this.

A few final notes. I've tried to provide a generally useful answer for future reference here, but I don't have experience with datasets of the scale you are considering and I can't say how efficient these approaches may be. Second, if you are going to use cross-validation or bootstrapping to compare among approaches, as always be sure to do this validation on all the steps of the model-building process.

### Similar Posts:

- Solved – Multiple Linear Regression – categorical variables
- Solved – How to use optimal scaling to scale an ordinal categorical variable
- Solved – How to use optimal scaling to scale an ordinal categorical variable
- Solved – Why is the categorical variable split up into separate variables in the regression model in r
- Solved – Relative importance of categorical variable in logistic regression