Solved – Algorithms for clustering documents by similar words and phrases

I'm working on a project where I'm trying to take a pair of documents and find and group (cluster) similar words and phrases between them.

Which algorithm would solve this kind of a problem? I know this is a very mundane and probably subjective question, but I'm new to clustering, and I'm still trying to work my way around the vocabulary.

Your help would be appreciated.

Right off the bat, you may want to look at various string distances. The only one I'm familiar with is the Levenshtein distance, which is pretty rudimentary. You could apply this on sentences or phrases.

You may want to take a look at some natural language processing techniques, too, such as stemming and tokenizing your data before running any clustering algorithms on it. If you like Python, I highly recommend nltk, which has lots of packages for natural language processing. It may even have a clustering or distance algorithm for you. A quick google gives me this package, but I've never used it.

Edit: Upon reflection, I might have misunderstood your question – are you clustering documents, or words/phrases?

Similar Posts:

Rate this post

Leave a Comment