I have large dataset (around 2 million records and 300 features) with a lot of missing data. Most of the independent variables are categorical (some of these variables have more than 40 valid values). The outcome is either Y or N. The Y outcome is a rare event: around 98% of outcomes are N.
I'm supposed to fit a logistic regression model to these data. I took random sample of them, keeping the same distribution. I am working in R, but I'm new to both R and logistic regression modeling and I have some questions:
factorto the outcome. Do I need to apply it on every categorical independent variable? I have more than 200 variables, some of them have only 2 valid values while others have 40! Will it affect the size of the data?
Is there any advice about attribute selection? Should it be done before fitting the logistic regression model or after, depending on the results?
Is it recommended to take biased sample data where the outcome Y is more than the original distribution in the large data?
There are fields like
groupId, etc. What type of data do we consider these to be? How to deal with them?
What other predictive models are suitable for this kind of data?
Q1: For Y/N variables you can, but it won't make any difference except give you control over whether Y or N is your base category in the default model fitting. For 40 category variables your model matrix will end up pretty big, it's true. More importantly it will require a lot of data to fit. Combinatorially speaking you need information about all combinations of independent variables, and even with the data you have, there'll be a lot of interpolation and model assumption.
Q2: The machine learning folk may have some ideas here. I dimly remember something about chi-squared and mutual information measures for selecting variables. It's also possible you could get the model fitting process to do it by using a Lasso (L1 regularization, a.k.a Laplace prior) on the coefficients, although I'm not sure how well current implementations scale.
Q3: If you take a biased sample then you can do a classic rare events design analysis. King and Zheng, 2001 is a good resource for how to do so: it's very simple and amounts to a simple intercept correction. So yes, this is a good idea – just don't forget to correct for the sampling scheme.
Q4: User ids are potential grouping variables, so you could, if you wanted aggregate data according to user (or other group). That could also make the estimation problem easier by moving from Bernoulli to Binomial assumptions about the dependent variable.
Q5: Any classification model will do, frankly: support vector machines, decision trees, or anything else should work, provided they scale to the size of your data and/or you can apply the rare events correction to them. Regularized logistic regression is a good start though. You might find the literature on text classification a good place to start looking.
- Solved – explaining an extremely large coefficient in a rare events logistic regression
- Solved – Oversampling correction for multinomial logistic regression
- Solved – Oversampling in logistic regression
- Solved – R package or SAS proc for modeling rare events
- Solved – Running logistic regression on survey data