Solved – Is Naive Bayes suitable for large datasets with thousands of features

I have a data set with 100 million rows and 15,000 categorical variables each with 0/1 values. My target variable is also a 0/1 binary variable. Is Naive Bayes suitable in terms of computational performance, and prediction? The main concern is the number of explanatory variables which may limit performance which is why I am not using random forests and SVMs.

Naive Bayes is only as suitable as the results are useful. It's called "naive" for a reason, since it makes strong assumptions, but it's also very popular and performs surprisingly well in a variety of situations. It's hard to say more without more details about your use case.

As for speed, naive Bayes classifiers are fitted in $O(np)$ time , where $n$ is the number of observations and $p$ is the number of features. Again, it's hard to say if that's good without more details. But it's a lot better than a support vector machine.

You might want to consider Vowpal Wabbit, a learning algorithm that is "able to learn from terafeature datasets with ease." It is designed to run very fast in parallel. You can read more about it here.

Similar Posts:

Rate this post

Leave a Comment