I have a bunch of users. Each user has a number of personality attributes, such as "fitness level" or "eco-consciousness", rated on a scale from 1 to 5. I want to calculate how similar two users are, so I can show each user a sorted list of "most similar users".
This seems to be a classic IR problem, and I've seen three different metrics used, but no discussion of why to choose one over the other:
Simple Arithmetic. The scores are already normalized to the same scale, so I can just add each user's scores up, and compare the sums to see who is most similar.
Cosine Similarity. Treat each user as an n-dimensional vector, where each scale is one dimension. Calculate the cosine of the angle between two users' vectors; cosines closer to 1 (smaller angles) are more similar.
Euclidean Distance. Each user is an n-dimensional vector again, but this time, calculate the distance between endpoints. Users that are close together are similar.
What are the advantages and disadvantages of each method? How does that change if the scores are not normalized to the same scale (i.e. if I add an "age" attribute)?
Align similarity formula with conceptual notion of similarity
- I would try to align the mathematical formula used to calculate similarity with what you intuitively or theoretically mean by similarity. A few conceptual issues include:
- Do you standardise within person and thereby focus on the profile of scores rather than the raw differences?
- Do you want to square differences on individual attributes (rather than take absolute differences) and thereby weight a few larger differences more than a larger number of small differences?
- Do you to weight differences on each variable equally or let some count for more; and if some are to count for more, are you going to decide how this operates or will you let something like the standard deviation of the attribute determine this?
Inspect groupings produced by different similarity formulas
- It's a good idea to inspect the similarity based groupings produced by different algorithms. See whether they converge or differ. Inspect disagreements and see which algorithm seems to map on to your conceptual definition of similarity.
A starting point
- I think euclidean or squared euclidean distance based on standardised variables is often a good option (it would be my starting point for your problem). Of course, if all your variables have a similar standard deviation (as is often the case when comparing likert items), then whether you standardise or not probably will not make much difference.