I am trying to predict a categorical variable (type of job, there are three classes) using a dataset that mainly consists of continuous variables (like years of education, salary,etc), using the Naive Bayes classifier in the package 'klaR'. My question is now this, in case I use continuous variables to train my Naive Bayes classifier on, I get very bad predictions for an out-of-sample dataset. However when I divide the continuous variables in categories (so by making it categorical) I get pretty good predictions. So is the problem that I don't specify correctly that I am using continuous variables, because by doing this I lose information and would expect worse results.
My code has the following form:
m<-NaiveBayes(Job~.,data=JobDataTrain) # Training in sample m_predict<-predict(m,JobDataTest)
Best Answer
The difference that you are seeing is likely a result of the fact that Naïve Bayes (NB) works quite differently on categorical and numerical variables. Explaining needs a little notation.
Assume that we are trying to predict Type that takes on values ${ t_1, t_2, …, t_n }$. For each variable, V, NB makes an estimate of $P(Type = t_i | V = v)$. For categorical variables, there is a simple way to compute this. Just take all points in the training data with $V=v$ and compute the proportion for each class, $t_i$. For continuous variables, NB makes another naïve assumption that for each $t_i$ the data with $Type = t_i$ are normally distributed. For each $t_i$ the mean and standard deviation of V is computed for points with $Type = t_i$. This normal approximation is used to estimate $P(V=v | Type = t_i)$ which, together with Bayes Law is used to estimate (something proportional to) $P(Type = t_i | V=v)$.
Of course not all data is normally distributed, so if your continuous variable does not match that model well, this Gaussian approximation may provide bad estimates of the needed probabilities.
Here is an (artificial) example of the behavior that you saw.
### Response to: https://stats.stackexchange.com/q/215146/141956 library(klaR) library(sm) ## One dimensional data set.seed(2017) x = c(runif(200,0,1), runif(50,2,3), runif(50,4,5), runif(200,6,7)) Type = factor(c(rep(1,200), rep(2,50), rep(1,50), rep(2,200))) df=data.frame(x, Type) sm.density.compare(x, Type, lty=c(2,2))
For both types, the distribution is non-Gaussian. But nevertheless NB uses a Gaussian approximation.
NB = NaiveBayes(Type ~ x, data=df) table(predict(NB, df)$class, df$Type) 1 2 1 200 50 2 50 200 NB$tables $x [,1] [,2] 1 1.278952 1.640579 2 5.703749 1.628859 mean(x[Type==1]) [1] 1.278952 sd(x[Type==1]) [1] 1.640579 sm.density.compare(x, Type, lty=c(2,2)) lines(seq(-2,9,0.1), dnorm(seq(-2,9,0.1), 1.3, 1.63), col="red", lwd=2) lines(seq(-2,9,0.1), dnorm(seq(-2,9,0.1), 5.7, 1.63), col="green", lwd=2)
NB represents both types by Gaussians with sd ~ 1.63 and means at about 1.3 and 5.7 . The dashed red distribution is approximated by the bold red curve and the dashed green distribution is approximated by the bold green curve. These poorly represent the data and they incorrectly predict the type for all of the points in the smaller bumps. The gaussian distributions are just not doing a good job of representing this data.
What if we discretize the data before applying NB?
## Discretize ## DiscX = cut(x, breaks=0:7) Ddf = data.frame(DiscX, Type) NB2 = NaiveBayes(Type ~ DiscX, data=Ddf) table(predict(NB2, Ddf)$class, df$Type) 1 2 1 250 0 2 0 250
Now, it correctly classifies all of the points in the training data. In this case, the discretized form of the data captures the structure much better than the Gaussians.
However, I want to caution that just because your data is not Gaussian does not mean that Naïve Bayes will give a bad answer. In fact, NB can do surprisingly well, even on non-normally distributed data.
Similar Posts:
- Solved – Using the Naive Bayes classifier in R with continuous variables
- Solved – Using the Naive Bayes classifier in R with continuous variables
- Solved – Mixing variable types in latent class/profile analysis
- Solved – Regression technique for data comprised of categorical explanatory variables & a continuous response variable
- Solved – Correlation between scale and categorical variable