# Solved – Why is softmax regression often written without the bias term

I am familiar with softmax regression being written by:

\$\$P(Y=ymid X=x)=frac{e^{[Wx+b]_{y}}}{sum_{forall i}e^{[Wx+b]_{i}}}\$\$
for the change of the class of \$Y\$ being \$y\$,
given observations of \$X\$ as being \$x\$.
and using subscripts to denote selecting the ith column of a matrix, and the ith element of a vector. That is the formulation used in this answer

But I look at other sources,
e.g. wikipedia,
ufldl.stanford.edu

and it uses the formula:
\$\$P(Y=ymid X=x)=frac{e^{[Wx]_{y}}}{sum_{forall i}e^{[Wx]_{i}}}\$\$

It seems to me that that bias term \$b\$ is clearly needed to handle the case of the classes not being balanced.

When we split the terms up:
\$\$P(Y=ymid X=x)=frac{e^{[Wx+b]_{y}}}{sum_{forall i}e^{[Wx+b]_{i}}}=frac{e^{[Wx]_{i}},e^{b{}_{y}}}{sum_{forall i}e^{[Wx]_{i}},e^{b{}_{i}}}\$\$
It also would seem to correspond with the prior probability term in Bayes' theorem:
\$\$P(Y=ymid X=x)=frac{P(X=xmid Y=y),P(Y=y)}{sum_{forall i}P(X=xmid Y=i),P(Y=i)}\$\$

It seems like it is required to me, but maybe I am missing something.
Why is it being left out in so many sources?

Contents

If you use matrix notation, then

\$\$ beta_0 + beta_1 X_1 + dots +beta_k X_k \$\$

can be defined in terms of design matrix that already contains a column of ones for the intercept

\$\$ mathbf{X} = left[ begin{array}{cccc} 1 & x_{1,1} & dots & x_{1,k} \ 1 & x_{2,1} & dots & x_{2,k} \ vdots & vdots & ddots & vdots \ 1 & x_{n,1} & dots & x_{n,k} end{array} right] \$\$

so writing \$beta_0 + dots\$ is redundant.

Rate this post