I have a dataset of about 75K samples with about 20 features per sample (12 of which are probably important) describing various credit profiles – credit score, late payments, income, etc. Some of the input variables are continuous and some are categorical. The output variable is continuous – the ROI associated with each loan. What is the best supervised learning algorithm to predict the ROI? How do I select the features and weigh them?
I was looking at some examples of SVM but I need something that is not binary (can actually predict the ROI)
If your categorical data are ordered, and therefore could be transformed to continuous (but integer) variables, then the Random Forest algorithm is likely to be a competitive solution which requires very little tuning compared with other methods (such as neural networks). RF will effectively do feature selection (of sorts) internally so you shouldn't need to do any pre-selection before sending the data to the algorithm.
In the case that you have unordered categorical variables then you can create a set of new binary variables representing each categorical one. For instance the variable Animal which could be one of Cat, Dog or Horse would become three variables, Cat, Dog and Horse with one set to 1 and the rest 0 depending on the value of the original variable. There is a succinct technical name for this kind of approach, but I can't remember what that is. It doesn't matter that this might produce lots of new variables as RF scales well with the number of inputs.
- Solved – How to find correlation if we have continuous and categorical variables present in the dataset as features and target is again binary
- Solved – Neural Nets: One-hot variable overwhelming continuous
- Solved – PCA with continuous and categorical features
- Solved – Mixing continuous and binary data with linear SVM
- Solved – Are numerical variables must for random forest algorithm?