Suppose I have a data set with the following structure:
Each row of the data set indexes a town. The first column/feature variable is the total population while the other feature variables include the count of people who own various items (one feature variable for cars, one for home appliances, etc.), while still others measure average income etc.
Now, it is often necessary to 'transform' the feature vectors before running some sort of regression algorithm on the data, for example standardizing them.
Suppose the towns have very disparate populations (call this feature $X_1$ and let town $i$ have value $X_1^i$). Consider the feature vector, say $X_2$ measuring the number of some X in each town. My question is:
Should one, in general, first transform $X_2$ in proportion to the total population of the towns, that is $X_2^i mapsto frac{X_2^i}{X_1^i}$ and then standardize the column by $X_2^i mapsto frac{X_2^i-bar{X_2}}{hat{sigma}_{X_2}^2}$ or, even simply scaling the values to the interval $[max(X_2), min(X_2)]$?
The reason I am asking the question is: I can imagine a case where despite the towns having very different population counts, there is an item which have roughly the same count in each town. In which case, if we were to simply standardize the columns, it will reduce the values in the feature column to zero (or nearby) and intuitively, there will be tremendous loss of information.
Assume that I know that $X_1$ is collinear with $X_2$ and I won't be using that feature.
Best Answer
There is nothing particular wrong with standardizing your variables, it simply won't do anything beneficial for you. The best guide to the topic on CV is: When should you center and when should you standardize? I would recommend you read it.
The statement that standardizing will "reduce the values in the feature column to zero (or nearby) and intuitively, there will be tremendous loss of information" if "there is an item which have roughly the same count in each town" is not correct. If there is any variation at all in those features, the exact same amount of information will exist in the variables after standardizing them. On the other hand, if every town has the same amount, there is no information in that feature. In some sense, there wouldn't be any information after standardizing if you could do it, but it would be impossible to standardize such a variable, because the standard deviation would be $0$ and you can't divide by zero.
Regarding the question in your comment, I would recommend you turn each variable that indicates ownership of an item from a count to a proportion. Of course it depends on what you want to find out, but the raw counts are almost certainly less informative for your question than proportions, and if you include the town population variable, you are likely to have multicollinearity that unhelpfully increases your standard errors.
Similar Posts:
- Solved – Data standardization for training and testing sets for different scenarios
- Solved – Do we need to demean and standardize all variables in a model
- Solved – How to calculate mutual information between a feature and target variable
- Solved – Do you ever center AND standardize variables in multiple regression
- Solved – How to transform a frequency data in to normal distribution