I am trying to do a regression analysis for some data, say 20 variables
$left( {{x_1},{x_2},{x_3},…} right)$ where the underlying probability distribution is known (e. g. ${x_1} in {rm N}({mu _1},{sigma _1})$ , ${x_2} in U({a_2},{b_2})$ and so on).
The variables are assumed to be uncorrelated. The overall behavior of $y = f({x_1},{x_2},{x_3},…)$ is nonlinear.
Now, I would like to use a method of the scikit-learn module (e.g. Lasso, Lars, Ridge, Bayesian Regression etc.) for a metamodel-fit.
For taking into account the cross-influences and non-linear behaviour of some variables in y, I want to use a polynomial, i. e. I don't just give the vector $overrightarrow x = left( {{x_1},{x_2},{x_3},…} right)$ to the regression method, I rather feed it $left( {{x_1},{x_2},{x_3},…,{x_1}*{x_1},{x_1}*{x_2},…} right)$ which is a polynomial of degree two or more.
My question is: Which method (Lasso etc.) is best for this problem? How can I give the information to the regression method that the underlying distributions (mean and std deviation) and correlation of the higher order terms is known. For example ${{x_1}}$ and ${{x_1}*{x_1}}$ are highly correlated and both their distributions are known by ${mu },{sigma }$. How can I add this information to the regression analysis? Otherwise without this additional information, I guess it will fit a poor model.
Any ideas?
Best Answer
You wrote you want to use sklearn anyway, did you take a look at the sklearn.preprocessing.PolynomialFeatures
class? This should solve the first part of your problem.
For the other part, why not actually try and measure? Run e.g. LassoCV on the polynomial dataset and check if holding out very correlated features changes performance?
Embedding this information sounds rather complicated, I'd go for the simpler approach of either removing correlated features beforehand or running a PCA on it. And see how things change.