Solved – The Effect of Stopword Filtering prior to Word Embedding Training

Recently I have played with the pretrained GLOVE word embedding model for Twitter

http://nlp.stanford.edu/projects/glove/

I notice that common stopwords are existing in the model. That is, there is no stopword filtering before the training of the model.

I wonder if stopword filtering would improve performance in terms of:

  1. higher correlation (or cos-sim) between semantically similar words
  2. less noisy sum for aggregation of set of words, because I've heard the main problem of aggregation of word embedding is poor weighing on a significant portion of noisy words in the set.

Or does filtering stopwords give problems that I am not seeing?

One common approach is to simply subsample the most common words in the corpus. This way they have less effect on the model but you don't have to completely get rid of them. It can also speed up training because you spend less time dealing with stopwords that don't carry all that much information compared to the amount of times they appear in the corpus.

https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Similar Posts:

Rate this post

Leave a Comment