I'm building a model that predicts the subreddit of a given reddit submission. I have a question regarding pre-processing my text (submissions). The order I followed when pre-processing my text was:
- Remove punctuations
- Remove stopwords
- Lemmatize the remaining text
I then fed the parsed text into a CountVectorizer.
While this got me decent accuracy on my naive bayes model, I discovered I was getting much better results (about ~5% gains) when I skipped #2 altogether. What are some reasons why this could be happening?
Best Answer
Stop words typically remove such things as "a, an, the, it". Often this can be beneficial when we are classifying based on topics, which are well described by nouns and adjectives.
However some text classification tasks are more abstract. Consider classifying fiction and non-fiction articles on the same topic, what would the difference between these two writing styles be? They would probably use the same nouns but what about the frequency of "the" vs "an" or "he" vs "they"?
Stop words are, by definition, words that contain no information for your classification task. There is no universal set of stop words that will improve all text classification tasks. If you use an off-the-shelf stop words dictionary, you could be throwing away valuable information.