I am experimenting with Classification algorithms in ML and am looking for some corpus to train my model to distinguish among the different text categories like sports, weather, technology, football, cricket etc.
Where I can find some dataset with these categories?
An option would be to crawl Wikipedia for these 30+ categories. Is there a better way to do this?
Edit: I want to train the model using the bag of words approach for these categories, then classify new/unknown websites to these predefined categories depending on the content of the webpage.
Best Answer
scikit-learn 20-newsgroups-text-dataset has 11314 train + 7532 test samples with 10,000 or more sparse features. The newsgroup categories are:
alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc
For 2, 3, 4, 5 of these newsgroups (the worst ones) I get
83.2 82.6 82.2 80.6 % correct, using the fast sgd classifier.
(The first run of fetch_20newsgroups
will take a while to download and cache the data.)
Similar Posts:
- Solved – Topic detection for a sentence or an article using Machine learning
- Solved – Comprehensive dataset for documents classification
- Solved – Apply statsmodels PCA to new data
- Solved – Apply statsmodels PCA to new data
- Solved – Do components of PCA really represent percentage of variance? Can they sum to more than 100%