I have the following concrete example of the problem. Every day I measure 15 different parameters. In the end of the month I have one number (let's call it "target") associated with the given month. I want to determine how the target depends on the parameters. I have data for about 18 month.
What I did so far is following. For each parameter I have calculated month-averages. In other words I have replaced daily data by monthly data. As a result I needed to find how the target depends on the 15 "monthly-averaged" parameters.
I found out that some of the 15 monthly-averaged parameters correlate strongly with the target. So, my current model is: target is a linear function of few (in my case 2 out of 15) monthly-averaged parameters.
Now I want to try to improve the model. My problem is that even for a simple generalization, in which target is a linear function of all 15 parameter, I get an over-fit (and this is expectable since number of observations (18) is close to the number of argument (15), and I want to have a training and evaluation sets).
I do not like the fact that I replace daily parameter by monthly one (by averaging them). I loose a lot of information. But without that I would have 30*15*18 arguments and only 18 observations. So, over-fit is unavoidable. What can one do in this case?
The first what comes to mind is dimensionality reduction but I would work if I need to replace 200 dimensional vector by, let's say 3 dimensional. In my case I do not have a "vector". Data for each month is a "matrix" (15 parameters x 30 days).
Best Answer
The obvious thing to try is regularization if you use linear regression. There is regularization term in its cost function. It controls the trade-off between bias and variance. You should just specify a single parameter $lambda$. More details here – regularization
Similar Posts:
- Solved – Monthly realized variance, different number of observations per month
- Solved – Visualizing the trend of monthly change in a times-series year-over-year
- Solved – Why is the penalty term added instead of subtracting it from loss term in regularization
- Solved – Volatility of x and y variables in linear regression
- Solved – Disaggregate monthly forecasts into daily data