I am familiar with softmax regression being written by:

$$P(Y=ymid X=x)=frac{e^{[Wx+b]_{y}}}{sum_{forall i}e^{[Wx+b]_{i}}}$$

for the change of the class of $Y$ being $y$,

given observations of $X$ as being $x$.

and using subscripts to denote selecting the ith column of a matrix, and the ith element of a vector. That is the formulation used in this answer

But I look at other sources,

e.g. wikipedia,

ufldl.stanford.edu

and it uses the formula:

$$P(Y=ymid X=x)=frac{e^{[Wx]_{y}}}{sum_{forall i}e^{[Wx]_{i}}}$$

It seems to me that that bias term $b$ is clearly needed to handle the case of the classes not being balanced.

When we split the terms up:

$$P(Y=ymid X=x)=frac{e^{[Wx+b]_{y}}}{sum_{forall i}e^{[Wx+b]_{i}}}=frac{e^{[Wx]_{i}},e^{b{}_{y}}}{sum_{forall i}e^{[Wx]_{i}},e^{b{}_{i}}}$$

It also would seem to correspond with the prior probability term in Bayes' theorem:

$$P(Y=ymid X=x)=frac{P(X=xmid Y=y),P(Y=y)}{sum_{forall i}P(X=xmid Y=i),P(Y=i)}$$

It seems like it is required to me, but maybe I am missing something.

Why is it being left out in so many sources?

**Contents**hide

#### Best Answer

If you use matrix notation, then

$$ beta_0 + beta_1 X_1 + dots +beta_k X_k $$

can be defined in terms of design matrix that already contains a column of ones for the intercept

$$ mathbf{X} = left[ begin{array}{cccc} 1 & x_{1,1} & dots & x_{1,k} \ 1 & x_{2,1} & dots & x_{2,k} \ vdots & vdots & ddots & vdots \ 1 & x_{n,1} & dots & x_{n,k} end{array} right] $$

so writing $beta_0 + dots$ is redundant.

### Similar Posts:

- Solved – Why is softmax regression often written without the bias term
- Solved – Can full conditionals determine the joint distribution
- Solved – Finding the Bayesian classifier for a bivariate Gaussian distribution
- Solved – Given two independent normal random variables $X$ and $Y$, what is $P(Xleq xmid X>Y)$
- Solved – Gradient of softmax with cross entropy loss