Solved – Why is random forest inconsistent in text mining

Earlier I've used SVM (rbf kernel) in text mining with success, and after that for similar text mining work with long texts I've used random forest with success as well. However in a recent kaggle competition when I used random forest with a word vector (after doing SVD and reducing the dimension to 200-400), the result was never close to the svm in terms of RMSE. Any idea why random forest is inconsistent in different text mining but SVM does it more consistently?

I have used random forest successfully in text mining applications, although SVM with linear kernel, for instance, had reached superior classification accuracy. SVM is a good start when you search for good classification algorithms without prior understanding.

Despite random forest runs fast and is suitable for many applications, its results rely on which parameter values you choose. The same happens for major algorithms.

I suggest you to Start doing grid search over the following randomForest parameters to check if it changes your results: number of trees, number of features, maximum depth of your trees. Analyze the results according to parameter variation. You can combine this experiment with cross validation or such.

You can perform the same for other classifiers.

Is your data normalized? Do you apply standardization or something else worth mentioning?

Similar Posts:

Rate this post

Leave a Comment