Solved – Comparing two lists to measure for similarity

The problem I am trying to solve is finding the probability that two people are the same by cross-referencing the associates of the two people. For example, if person A is associated with the following people:

  • Jeff
  • Rick
  • Jessica
  • Mary

Person B is associated with the following people:

  • Ryan
  • Mary
  • Dennis
  • Scott
  • Jeff
  • Sharon
  • Rick
  • Larry
  • James

So these two people have the following people in common:

  • Mary
  • Jeff
  • Rick

How would I go about figuring out the likelihood that Person A and Person B are the same based on the common relationship with the three people above? There are three factors I can see right now, but I don't know how to weigh any of them:

  1. Ratio of common associates (doubled because seen from both sides) over the total number of associates
  2. Ratio of common associates over the number of associates for Person A
  3. Ratio of common associates over the number of associates for Person B

I'm not a statistician, so I don't know if what I've presented is the correct way to solve the problem. Can anyone provide some guidance?

I think using the Jaccard Distance would be suitable for this problem. The MinHash algorithm finds the Jaccard similarity coefficient.

Similar Posts:

Rate this post

Leave a Comment