I am implementing the Baum-Welch Algorithm for training a Hidden Markov Process, to basically better understand the training process.

I have implemented the iterative procedures described in Rabiner's classic paper. My Implementation is in `Wolfram Mathematica`

. The problem I am facing is the scaling step, if the sequence is 100 observations long then the probabilities easily cross the bounds of $10^{-30}$.

My question is can anyone explain or provide a link where the calculation of scaled forward and backward matrices is described?

For the forward procedure the code is as follows:

` ForwardProcedure[TM_, EM_, P_, N_, M_, Seq_] := Module[{DP, i, j, DPPart}, DP = ConstantArray[0, {N, M}]; DP[[1, All]] = Table[P[[i]]*EM[[i, Part[Seq, 1]]], {i, 1, N}]; DP = Table[ If[j == 1, P[[i]]*EM[[i, Part[Seq, 1]]], 0] , {i, 1, N}, {j, 1, Length@Seq}]; For[j = 2, j <= Length@Seq, j++, DPPart = DP[[All, j - 1]]; For[i = 1, i <= N, i++, DP[[i, j]] = Dot[DPPart, TM[[All, i]]]*EM[[i, Part[Seq, j]]]; ]; ]; DP ] `

Essentially `DP[[i, j]] = Dot[DPPart, TM[[All, i]]]*EM[[i, Part[Seq, j]]];`

This line is the update. Which simply takes the needed columns from Transition (TM), The alpha (DP) matrices and multiplies with the associated Emission value.

How can I have a scaling factor independent of i, when not all the t-s of the row of the DP matrix are filled?

#### Best Answer

## Background and Notation

Let $x_1, ldots, x_T$ be the observations, and $z_1, ldots, z_T$ be the hidden states. The forward-backward recursions are written in terms of $alpha(z_n)$ and $beta(z_n)$.

$alpha(z_n) = p(x_1,ldots,x_n,z_n)$ is the joint distribution of all the heretofore observed data and the most recent state. As $n$ gets large, this gets very small. This is just because of simple properties of probabilities. Probabilities of subsets get smaller. This is axiomatic.

$beta(z_n) = p(x_{n+1},ldots,x_T|z_n)$ is the probability of all the future data given the current state. This gets small as $n$ goes down, for the same reasons as above.

## Forward Algorithm

You can see that both of these things become very small, and hence cause numerical issues, whenever you have a non-tiny dataset. So instead of using the forward recursion $$ alpha(z_n) = p(x_n|z_n) sum_{z_{n-1}}alpha(z_{n-1})p(z_n|z_{n-1}) $$ use the **filtering recursion**: $$ p(z_n|x_1,ldots,x_n) = frac{p(x_n|z_n)sum_{z_{n-1}} p(z_n|z_{n-1})p(z_{n-1}|x_1,ldots,x_{n-1})}{p(x_n|x_1,ldots,x_{n-1})}. $$ You can get this algebraically by dividing both sides of the first recursion by $p(x_1,ldots,x_n)$. But I would program it by keeping track of the filtering distributions, not the $alpha$ quantities. You also don't need to worry about the normalizing constant usually–the last step you can normalize your probability vector if your state space isn't too large.

I am more familiar with the non-HMM state space model literature, and the filtering recursions are probably the most common set of recursions. I have found a few results on the internet where HMM people call this procedure "scaling." So it's nice to find this connection and figure out what all this $alpha$ stuff is.

## Backward Algorithm

Next, the traditional HMM backward algorithm is stated as $$ beta(z_n) = sum_{z_{n+1}}beta(z_{n+1})p(x_{n+1}|z_{n+1})p(z_{n+1}|z_n). $$ This one is trickier though. I have found a few websites that say to divide both sides by $p(x_1,ldots,x_n)$ again, and I have also found resources that say to divide both sides by $p(x_{t+1},ldots,x_T|x_1,ldots,x_t)$. I'm still working this out, though. Also, I suspect there's another connection between this and the backward recursions in other areas of state space model literature (for marginal and joint smoothing distributions). I'll keep you posted.

## Edit:

It turns out if you multiply both sides of the HMM backward equation above by $p(x_1, ldots, x_n,z_n)/p(x_1,ldots,x_T)$ you get the backward equation for the marginal smoothing distribution, which are more common in non-HMM state space models.

$$ p(z_n|x_1,ldots,x_T) = p(z_n|x_1,ldots,x_n) sum_{z_{n+1}}frac{p(z_{n+1}|z_n)}{p(z_{n+1}|x_1,ldots,x_n)}p(z_{n+1}|x_1,ldots,x_T). $$

I don't know if this is standard for using Baum-Welch, but it's pretty standard with other state space models. Cool.

## Edit 2:

https://www.springer.com/us/book/9780387402642 define on page 63 the normalized backward function as

$$ hat{beta}(z_n) = beta(z_n)frac{p(x_{1:n})}{p(x_{1:T})} = frac{p(z_n mid x_{1:T}) }{p(z_n mid x_{1:n})}, $$ which, as they mention on page 65, has the interpretation of the ratio of two marginal smoothing distributions.