Solved – Finding the best two predictor variables used conjointly, and levels of each

Twenty possible predictor variables in data set. One outcome variable.

Some of the predictor variables are not linear. So a standard linear multiple regression approach probably won't do. (And I do not want to transform the variables using some arbitrary "cubic," "quadratic" approach — leave the underlying variables as they are.)

Example: Time of day. This oscillates regularly, over 24 hours. (Like a sine wave.)

There is considerable random chaos inherent in some variables.

Example: Highway traffic frequency.

Some curve fitting may be advisable to dampen or smooth out the chaos.

I'm looking for possible statistical packages (and specific procedures) that can accomplish the goal — determining the two best predictor variables taken together of the twenty, and levels of each, that appear to best predict the outcome variable.

Considering all twenty variables, there are 190 possible combinations. Doing an analysis of each possible pair would take a great deal of effort. Best to find a statistical package that can run tests on each possible combination in one long go.

In summary: Software Package and Specific Procedure please. No need for a bunch of code.

Fitting your 190 models in one go is easy in software like R. See code below for an example. However, as @rolando2 suggests, some human judgement would go a long way. For example, for each of the 20 variables, could you look at them one at a time and work out the best way for them to be transformed or smoothed? (as you have started to do). Doing it bY brute force (eg fitting a cubic polynomial, as I do below) is not really recommended.

@rolando2's other essential point is that you should use some cross-validation. At a minimum, separate your data into a training and a testing set. There is lots of good advice on this elsewhere on the site.

Example of fitting 190 models, each with cubic relationships, below. This is meant to be more of an illustration as an answer than a recommendation. Basically, your conceptual and analytical tasks are much more complicated than mere software to fit lots of models.


# set up data expl <-, ncol=20)) names(expl) <- paste0("Var",1:20) resp <- rnorm(10000)   pair <- combn(1:20,2)  # create somehwere to hold results  res <- list()  for (i in 1:190){     x1 <- expl[ , pair[1,i]]     x2 <- expl[ , pair[2,i]]     res[[i]] <- lm(resp ~ poly(x1,3) + poly(x2,3)) }  names(res) <- paste0(names(expl)[pair[1,]], names(expl)[pair[2,]]) summary(res[[1]])$r.squared  # calculate the R squared for each report <- unlist(lapply(res, function(x){summary(x)$r.squared}))  # print to screen, with highest R-squared last report[order(report)]  

Similar Posts:

Rate this post

Leave a Comment