Solved – What to do if number of features is much larger than number of observations

I have the following concrete example of the problem. Every day I measure 15 different parameters. In the end of the month I have one number (let's call it "target") associated with the given month. I want to determine how the target depends on the parameters. I have data for about 18 month.

What I did so far is following. For each parameter I have calculated month-averages. In other words I have replaced daily data by monthly data. As a result I needed to find how the target depends on the 15 "monthly-averaged" parameters.

I found out that some of the 15 monthly-averaged parameters correlate strongly with the target. So, my current model is: target is a linear function of few (in my case 2 out of 15) monthly-averaged parameters.

Now I want to try to improve the model. My problem is that even for a simple generalization, in which target is a linear function of all 15 parameter, I get an over-fit (and this is expectable since number of observations (18) is close to the number of argument (15), and I want to have a training and evaluation sets).

I do not like the fact that I replace daily parameter by monthly one (by averaging them). I loose a lot of information. But without that I would have 30*15*18 arguments and only 18 observations. So, over-fit is unavoidable. What can one do in this case?

The first what comes to mind is dimensionality reduction but I would work if I need to replace 200 dimensional vector by, let's say 3 dimensional. In my case I do not have a "vector". Data for each month is a "matrix" (15 parameters x 30 days).

The obvious thing to try is regularization if you use linear regression. There is regularization term in its cost function. It controls the trade-off between bias and variance. You should just specify a single parameter $lambda$. More details here – regularization

Similar Posts:

Rate this post

Leave a Comment