I am familiar with softmax regression being written by:
$$P(Y=ymid X=x)=frac{e^{[Wx+b]_{y}}}{sum_{forall i}e^{[Wx+b]_{i}}}$$
for the change of the class of $Y$ being $y$,
given observations of $X$ as being $x$.
and using subscripts to denote selecting the ith column of a matrix, and the ith element of a vector. That is the formulation used in this answer
But I look at other sources,
e.g. wikipedia,
ufldl.stanford.edu
and it uses the formula:
$$P(Y=ymid X=x)=frac{e^{[Wx]_{y}}}{sum_{forall i}e^{[Wx]_{i}}}$$
It seems to me that that bias term $b$ is clearly needed to handle the case of the classes not being balanced.
When we split the terms up:
$$P(Y=ymid X=x)=frac{e^{[Wx+b]_{y}}}{sum_{forall i}e^{[Wx+b]_{i}}}=frac{e^{[Wx]_{i}},e^{b{}_{y}}}{sum_{forall i}e^{[Wx]_{i}},e^{b{}_{i}}}$$
It also would seem to correspond with the prior probability term in Bayes' theorem:
$$P(Y=ymid X=x)=frac{P(X=xmid Y=y),P(Y=y)}{sum_{forall i}P(X=xmid Y=i),P(Y=i)}$$
It seems like it is required to me, but maybe I am missing something.
Why is it being left out in so many sources?
Best Answer
If you use matrix notation, then
$$ beta_0 + beta_1 X_1 + dots +beta_k X_k $$
can be defined in terms of design matrix that already contains a column of ones for the intercept
$$ mathbf{X} = left[ begin{array}{cccc} 1 & x_{1,1} & dots & x_{1,k} \ 1 & x_{2,1} & dots & x_{2,k} \ vdots & vdots & ddots & vdots \ 1 & x_{n,1} & dots & x_{n,k} end{array} right] $$
so writing $beta_0 + dots$ is redundant.
Similar Posts:
- Solved – Why is softmax regression often written without the bias term
- Solved – Can full conditionals determine the joint distribution
- Solved – Finding the Bayesian classifier for a bivariate Gaussian distribution
- Solved – Given two independent normal random variables $X$ and $Y$, what is $P(Xleq xmid X>Y)$
- Solved – Gradient of softmax with cross entropy loss