Solved – Why is softmax regression often written without the bias term

I am familiar with softmax regression being written by:

$$P(Y=ymid X=x)=frac{e^{[Wx+b]_{y}}}{sum_{forall i}e^{[Wx+b]_{i}}}$$
for the change of the class of $Y$ being $y$,
given observations of $X$ as being $x$.
and using subscripts to denote selecting the ith column of a matrix, and the ith element of a vector. That is the formulation used in this answer

But I look at other sources,
e.g. wikipedia,
ufldl.stanford.edu

and it uses the formula:
$$P(Y=ymid X=x)=frac{e^{[Wx]_{y}}}{sum_{forall i}e^{[Wx]_{i}}}$$

It seems to me that that bias term $b$ is clearly needed to handle the case of the classes not being balanced.

When we split the terms up:
$$P(Y=ymid X=x)=frac{e^{[Wx+b]_{y}}}{sum_{forall i}e^{[Wx+b]_{i}}}=frac{e^{[Wx]_{i}},e^{b{}_{y}}}{sum_{forall i}e^{[Wx]_{i}},e^{b{}_{i}}}$$
It also would seem to correspond with the prior probability term in Bayes' theorem:
$$P(Y=ymid X=x)=frac{P(X=xmid Y=y),P(Y=y)}{sum_{forall i}P(X=xmid Y=i),P(Y=i)}$$

It seems like it is required to me, but maybe I am missing something.
Why is it being left out in so many sources?

If you use matrix notation, then

$$ beta_0 + beta_1 X_1 + dots +beta_k X_k $$

can be defined in terms of design matrix that already contains a column of ones for the intercept

$$ mathbf{X} = left[ begin{array}{cccc} 1 & x_{1,1} & dots & x_{1,k} \ 1 & x_{2,1} & dots & x_{2,k} \ vdots & vdots & ddots & vdots \ 1 & x_{n,1} & dots & x_{n,k} end{array} right] $$

so writing $beta_0 + dots$ is redundant.

Similar Posts:

Rate this post

Leave a Comment