Suppose you have 500,000 possible factors that could effect your response variable 'profit'. What is the best way to deal with this data set and how large should the data set be for analysis to be valid? I know there are a number of ways for dimension reduction such as PCA, but are there more 'practical methods' for dimensionality reduction? Also, if I wanted to use all 500,000 factors, how large should n be? I know you can model with P>n but is there a threshold?
Best Answer
This is a very open-ended question and in many cases I think domain knowledge will play a crucial role. Having said that, I think that it will be well-worth your time to check the Royal Society's Philosophical Transactions A theme issue on Statistical challenges of high-dimensional data.
In general everything revolves around penalised estimators (eg. LASSO), appropriate dimension reduction (eg. PCA) and/or intelligent information criteria (eg. AICc). Do not expect a silver bullet approach (eg. boosting) to solve all your problems. In the end of the day, all models are represent our intrinsic understanding of the problem's nature. If we have nearly no understanding of what we try to solve/model and we simply dump data in and hope for the best there is a good chance we will get meaningless answers.