In every example I see(spam, negative vs positive tweet , weather study…) there is always the assumption that the input features (or variables) are independent.
In order for me to be able to understand whether naive Bayes can be applied to my problem, I need to truly understand what dependent variables means.
I know what dependent variables mean in mathematics but I can't seem to be able to grasp the notion in this sense because I've never come across a case study where the person says that they have dependent features so they can't use a particular data mining algorithm.
My main confusion is due to the fact that in my mind, all features are connected somehow(and I picture them as being dependent) for example in the weather case, if we choose humidity, temperature and wind as inputs aren't they related to a certain degree? like if it's really windy the temperature usually drops. I feel like maybe I'm misunderstanding dependency when it comes to data mining algorithms.
Can someone provide me with an example in which the independent features assumption can't apply?
Best Answer
In Naive Bayes, they are talking about conditional dependence. First let's start with conditional independence. $X_1, ldots, X_n$ are conditionally independent given $C=c$ if for all $x_1, ldots, x_n,c$ $$ F_{X_1,ldots,X_n|C}(x_1,ldots,x_n|c) = F_{X_1|C}(x_1|c)F_{X_2|C}(x_2|c) times cdots times F_{X_n|C}(x_n|c), $$ aka the joint cumulative (conditional) distribution function factors into (conditional) marginal cumulative distribution functions.
So they are dependent if the above is not true. There only has to be one sequence $x_1', ldots, x_n',c'$ such that $$ F_{X_1,ldots,X_n|C}(x_1,ldots,x_n|c) neq F_{X_1|C}(x_1|c)F_{X_2|C}(x_2|c) times cdots times F_{X_n|C}(x_n|c). $$ Intuitively this means that knowing something about one random variable will tell you about the probabilities of the others.
One example where this is true, say $C$ denotes the category of someone who either has a disease or does not. Let's say $1$ means they do and $0$ means they do not. Then let $X_1, ldots, X_n$ be random measurements of this person. Depending on the measurements, $F_{X_1,ldots,X_n|C}(x_1,ldots,x_n|c)$ is not likely to factor, so they would be conditionally (on the disease status) dependent measurements. Intuitively this means that if you're looking at a person with the disease, the doctor is likely to know that the measurements he/she is about to take are likely to be related somehow.