Solved – How to identify variable (from many variables) which is able to discriminate between groups

I currently have a data frame with 98 observations and 107 variables. All of the variables are numeric, but one variable is binary (yes or no). My goal is to determine which correlation and/or variable give the greatest segregation between the yes and no samples. I have been using the pairs () function to do this, but I can only do a few variables at a time. Is there a way to determine which correlation gives the greatest discernment between yes and no?

To Clarify – My table is 98 observations and 107 variables, but doing a correlation matrix with the pairs function is not able to fit all of the variables.

I have used this function:

pairs(x[70:80], ch=21, bg=c("red","green")[unclass(x$outcome)]) 

When you have multiple variable and you are looking for variable(s) which is the best for discriminating between groups ("yes" and "no" samples in this case) a tool for this is MANOVA.

# Suppose we have a data.frame with 7 variables and one group: my.data<-data.frame(v1=rnorm(100),v2=rnorm(100),v3=rnorm(100), v4=rnorm(100),v5=rnorm(100),v6=rnorm(100), v7=c(rnorm(50), rnorm(50)+20),response=rep(c("yes","no"), each=50))  # run MANOVA my.mnv<-manova(cbind(v1,v2,v3,v4,v5,v6,v7) ~ response, data=my.data)  # and look on p-values (if p-value < 0.05 then it is able to  # significantly discriminate between "yes" and "no") summary.aov(my.mnv)  # plot pairs(my.data[c("v1","v2","v3","v4","v5","v6","v7")], pch=22, bg=c("red", "yellow")[unclass(my.data$response)]) 

It's not good to make conclusions about statistical significance based on looking on the plot (although it is necessary to look on it). In you case of 107 variables the pairs() plot will be very chaotic.

Similar Posts:

Rate this post

Leave a Comment