I own around 40,000 text files for preprocessing (in purpose of document classification). I used R (with tm package) for prototype and now looking for a equivalent Java library for products.
However, for very fundamental tasks, i.e. text preprocessing, I found a very strange problem. That, with Weka, I apply punctuation and stop-words removal, and the same operations with R. Basically, the generated vocabulary (terms) size should be relatively the same. However, weka returns a vocabulary (attributes in arff file) with only 35,000 terms, while in R, there are more than 1 million distinct terms.
Can anyone help me understand this problem, or introduces me some more reliable Java libraries for text preprocessing?
Best Answer
Did you apply the StringToWordVector
in Weka? If so, then you did more than just punctuation and stop-words removal. StringToWordVector
outputs only the doc-term matrix of the input text files, so once the above mentioned preprocessing is done Weka will create 1 term for each unique word. 35k terms sounds logical for 40k texts.
The preprocessing in R seems to have been only the punctuation and stop-words removal. So 40k documents results in 1M words, but not unique words. Are your text files approximately 25 words on average? If this is not the case, then there is something else going on indeed.
Similar Posts:
- Solved – Software or libraries to create doc-term matrix
- Solved – Software or libraries to create doc-term matrix
- Solved – Latent Dirichlet Allocation as input for WEKA
- Solved – Stopword removal (suprisingly) decreases accuracy of naive-bayes model
- Solved – How to make a seq2seq model work with infinite vocabulary