Solved – corpus specifically for categories like sports, entertainment, or health

I am experimenting with Classification algorithms in ML and am looking for some corpus to train my model to distinguish among the different text categories like sports, weather, technology, football, cricket etc.

Where I can find some dataset with these categories?

An option would be to crawl Wikipedia for these 30+ categories. Is there a better way to do this?

Edit: I want to train the model using the bag of words approach for these categories, then classify new/unknown websites to these predefined categories depending on the content of the webpage.

scikit-learn 20-newsgroups-text-dataset has 11314 train + 7532 test samples with 10,000 or more sparse features. The newsgroup categories are:

alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc 

For 2, 3, 4, 5 of these newsgroups (the worst ones) I get
83.2 82.6 82.2 80.6 % correct, using the fast sgd classifier.

(The first run of fetch_20newsgroups will take a while to download and cache the data.)

Similar Posts:

Rate this post

Leave a Comment