Solved – Cluster analysis with skewed distibutions

For my master's thesis I would like to use different clustering algorithms to cluster municipalities (as objects) in regard to their land-use characteristics (as variables).

Analyzing my data descriptively I noted that I have a lot of extremely left skewed distributions (for example a lot of municipalities have zero values for some land use options and other have very high values). This is also true after standardizing my data by area or population…

Can anyone give me some advice on how does this will affect my cluster analysis? I think it may be important regarding the choice of my distance measurement (for example the absence of values to be interpreted as a non-similaritiy).

I didn't found a lot of information in this case in the common literature.

You may want to spend more time on data preparation.

For example, one may argue that "area" inherently is a quadratic value (and "volume" is inherently cubic). And thus, in order to make attributes more comparable, a x_new = sqrt(x_old) transformation may be sensible for some attributes.

Similar Posts:

Rate this post

Leave a Comment