Recently I have played with the pretrained GLOVE word embedding model for Twitter
http://nlp.stanford.edu/projects/glove/
I notice that common stopwords are existing in the model. That is, there is no stopword filtering before the training of the model.
I wonder if stopword filtering would improve performance in terms of:
- higher correlation (or cos-sim) between semantically similar words
- less noisy sum for aggregation of set of words, because I've heard the main problem of aggregation of word embedding is poor weighing on a significant portion of noisy words in the set.
Or does filtering stopwords give problems that I am not seeing?
Best Answer
One common approach is to simply subsample the most common words in the corpus. This way they have less effect on the model but you don't have to completely get rid of them. It can also speed up training because you spend less time dealing with stopwords that don't carry all that much information compared to the amount of times they appear in the corpus.
Similar Posts:
- Solved – Performing Word Embeddings with domain-specific data
- Solved – Using word embeddings / word2vec for classification of entiy
- Solved – Using word embeddings / word2vec for classification of entiy
- Solved – Is the Keras Embedding layer dependent on the target label
- Solved – How does python-glove compute most similar