I'm working on a project where I'm trying to take a pair of documents and find and group (cluster) similar words and phrases between them.
Which algorithm would solve this kind of a problem? I know this is a very mundane and probably subjective question, but I'm new to clustering, and I'm still trying to work my way around the vocabulary.
Your help would be appreciated.
Best Answer
Right off the bat, you may want to look at various string distances. The only one I'm familiar with is the Levenshtein distance, which is pretty rudimentary. You could apply this on sentences or phrases.
You may want to take a look at some natural language processing techniques, too, such as stemming and tokenizing your data before running any clustering algorithms on it. If you like Python, I highly recommend nltk, which has lots of packages for natural language processing. It may even have a clustering or distance algorithm for you. A quick google gives me this package, but I've never used it.
Edit: Upon reflection, I might have misunderstood your question – are you clustering documents, or words/phrases?
Similar Posts:
- Solved – Clustering of documents that are very different in number of words
- Solved – K-means clustering feature selection
- Solved – Clustering a list of similar categorical words and phrases in python
- Solved – Viable distance metric for text articles
- Solved – Incorporating new words in tfidf feature-vector for online clustering