# Solved – Scaling step in Baum-Welch algorithm

I am implementing the Baum-Welch Algorithm for training a Hidden Markov Process, to basically better understand the training process.

I have implemented the iterative procedures described in Rabiner's classic paper. My Implementation is in `Wolfram Mathematica`. The problem I am facing is the scaling step, if the sequence is 100 observations long then the probabilities easily cross the bounds of \$10^{-30}\$.

My question is can anyone explain or provide a link where the calculation of scaled forward and backward matrices is described?

For the forward procedure the code is as follows:

`` ForwardProcedure[TM_, EM_, P_, N_, M_, Seq_] :=   Module[{DP, i, j, DPPart},   DP = ConstantArray[0, {N, M}];   DP[[1, All]] = Table[P[[i]]*EM[[i, Part[Seq, 1]]], {i, 1, N}];   DP = Table[     If[j == 1, P[[i]]*EM[[i, Part[Seq, 1]]], 0]     , {i, 1, N}, {j, 1, Length@Seq}];   For[j = 2, j <= Length@Seq, j++,    DPPart = DP[[All, j - 1]];    For[i = 1, i <= N, i++,     DP[[i, j]] = Dot[DPPart, TM[[All, i]]]*EM[[i, Part[Seq, j]]];     ];    ];   DP   ] ``

Essentially `DP[[i, j]] = Dot[DPPart, TM[[All, i]]]*EM[[i, Part[Seq, j]]];` This line is the update. Which simply takes the needed columns from Transition (TM), The alpha (DP) matrices and multiplies with the associated Emission value.

How can I have a scaling factor independent of i, when not all the t-s of the row of the DP matrix are filled?

Contents

## Background and Notation

Let $$x_1, ldots, x_T$$ be the observations, and $$z_1, ldots, z_T$$ be the hidden states. The forward-backward recursions are written in terms of $$alpha(z_n)$$ and $$beta(z_n)$$.

$$alpha(z_n) = p(x_1,ldots,x_n,z_n)$$ is the joint distribution of all the heretofore observed data and the most recent state. As $$n$$ gets large, this gets very small. This is just because of simple properties of probabilities. Probabilities of subsets get smaller. This is axiomatic.

$$beta(z_n) = p(x_{n+1},ldots,x_T|z_n)$$ is the probability of all the future data given the current state. This gets small as $$n$$ goes down, for the same reasons as above.

## Forward Algorithm

You can see that both of these things become very small, and hence cause numerical issues, whenever you have a non-tiny dataset. So instead of using the forward recursion $$alpha(z_n) = p(x_n|z_n) sum_{z_{n-1}}alpha(z_{n-1})p(z_n|z_{n-1})$$ use the filtering recursion: $$p(z_n|x_1,ldots,x_n) = frac{p(x_n|z_n)sum_{z_{n-1}} p(z_n|z_{n-1})p(z_{n-1}|x_1,ldots,x_{n-1})}{p(x_n|x_1,ldots,x_{n-1})}.$$ You can get this algebraically by dividing both sides of the first recursion by $$p(x_1,ldots,x_n)$$. But I would program it by keeping track of the filtering distributions, not the $$alpha$$ quantities. You also don't need to worry about the normalizing constant usually–the last step you can normalize your probability vector if your state space isn't too large.

I am more familiar with the non-HMM state space model literature, and the filtering recursions are probably the most common set of recursions. I have found a few results on the internet where HMM people call this procedure "scaling." So it's nice to find this connection and figure out what all this $$alpha$$ stuff is.

## Backward Algorithm

Next, the traditional HMM backward algorithm is stated as $$beta(z_n) = sum_{z_{n+1}}beta(z_{n+1})p(x_{n+1}|z_{n+1})p(z_{n+1}|z_n).$$ This one is trickier though. I have found a few websites that say to divide both sides by $$p(x_1,ldots,x_n)$$ again, and I have also found resources that say to divide both sides by $$p(x_{t+1},ldots,x_T|x_1,ldots,x_t)$$. I'm still working this out, though. Also, I suspect there's another connection between this and the backward recursions in other areas of state space model literature (for marginal and joint smoothing distributions). I'll keep you posted.

## Edit:

It turns out if you multiply both sides of the HMM backward equation above by $$p(x_1, ldots, x_n,z_n)/p(x_1,ldots,x_T)$$ you get the backward equation for the marginal smoothing distribution, which are more common in non-HMM state space models.

$$p(z_n|x_1,ldots,x_T) = p(z_n|x_1,ldots,x_n) sum_{z_{n+1}}frac{p(z_{n+1}|z_n)}{p(z_{n+1}|x_1,ldots,x_n)}p(z_{n+1}|x_1,ldots,x_T).$$

I don't know if this is standard for using Baum-Welch, but it's pretty standard with other state space models. Cool.

## Edit 2:

https://www.springer.com/us/book/9780387402642 define on page 63 the normalized backward function as

$$hat{beta}(z_n) = beta(z_n)frac{p(x_{1:n})}{p(x_{1:T})} = frac{p(z_n mid x_{1:T}) }{p(z_n mid x_{1:n})},$$ which, as they mention on page 65, has the interpretation of the ratio of two marginal smoothing distributions.

Rate this post