I have a reasonably large dataset (d) with predictor variables x1…xn and a target variable y. I can use recursive partitioning (such as CART or rpart in R) to find subsets of d with a high (or low) average y. However, I am interested in subsets with a high **correlation** between x1 and y. For example, suppose that in the subset defined by d[x2>5 & x3<7], the linear model y = a*x1 + b has an r^2 of 90%, which I will call 'high.' I am looking for an algorithm that will take in dataset d, find the subset d[x2>5 & x3<7] (as well as others that would produce a high r^2 using the linear model), and give me, as output, the list of subsets found and the r^2 of each. Just like in recursive partitioning, this algo would look for subsets as large as possible, and try to arrive at them using as few 'steps' or 'cuts' as possible (e.g. d[x2>5 & x3<7] would be two 'cuts')

In an ideal world, I would even get to specify the model – i.e. instead of using a linear model y = a*x1 + b, I would like to use a logistic model, since y is binary in my particular dataset.

Is there an algorithm that can find those subsets for me automatically? Is this algorithm perchance implemented in R?

Thank you!

**Contents**hide

#### Best Answer

I think you might find something of interest using the caret package:

The method findCorrelation does what you want. You can change the value to the desired amount.

`library(caret) tooHigh <- findCorrelation(cor(rbind(Xtrain,Xtest)), .95) Xtrainfiltered <- Xtrain[, -tooHigh] Xtestfiltered <- Xtest[, -tooHigh] `

### Similar Posts:

- Solved – Algorithm to find subsets with high correlation
- Solved – Algorithm to find subsets with high correlation
- Solved – How bagging on CART (RPART) is different from CART with cross validation
- Solved – Question about Recursive feature elimination
- Solved – Determining correlation in certain subsets of a dataset in R