Solved – Algorithm to find subsets with high correlation

I have a reasonably large dataset (d) with predictor variables x1…xn and a target variable y. I can use recursive partitioning (such as CART or rpart in R) to find subsets of d with a high (or low) average y. However, I am interested in subsets with a high correlation between x1 and y. For example, suppose that in the subset defined by d[x2>5 & x3<7], the linear model y = a*x1 + b has an r^2 of 90%, which I will call 'high.' I am looking for an algorithm that will take in dataset d, find the subset d[x2>5 & x3<7] (as well as others that would produce a high r^2 using the linear model), and give me, as output, the list of subsets found and the r^2 of each. Just like in recursive partitioning, this algo would look for subsets as large as possible, and try to arrive at them using as few 'steps' or 'cuts' as possible (e.g. d[x2>5 & x3<7] would be two 'cuts')

In an ideal world, I would even get to specify the model – i.e. instead of using a linear model y = a*x1 + b, I would like to use a logistic model, since y is binary in my particular dataset.

Is there an algorithm that can find those subsets for me automatically? Is this algorithm perchance implemented in R?

Thank you!

I think you might find something of interest using the caret package:
The method findCorrelation does what you want. You can change the value to the desired amount.

library(caret)  tooHigh <- findCorrelation(cor(rbind(Xtrain,Xtest)), .95)  Xtrainfiltered <- Xtrain[, -tooHigh] Xtestfiltered  <-  Xtest[, -tooHigh] 

Similar Posts:

Rate this post

Leave a Comment