Solved – Scaling step in Baum-Welch algorithm

I am implementing the Baum-Welch Algorithm for training a Hidden Markov Process, to basically better understand the training process.

I have implemented the iterative procedures described in Rabiner's classic paper. My Implementation is in Wolfram Mathematica. The problem I am facing is the scaling step, if the sequence is 100 observations long then the probabilities easily cross the bounds of $10^{-30}$.

My question is can anyone explain or provide a link where the calculation of scaled forward and backward matrices is described?

For the forward procedure the code is as follows:

 ForwardProcedure[TM_, EM_, P_, N_, M_, Seq_] :=   Module[{DP, i, j, DPPart},   DP = ConstantArray[0, {N, M}];   DP[[1, All]] = Table[P[[i]]*EM[[i, Part[Seq, 1]]], {i, 1, N}];   DP = Table[     If[j == 1, P[[i]]*EM[[i, Part[Seq, 1]]], 0]     , {i, 1, N}, {j, 1, Length@Seq}];   For[j = 2, j <= Length@Seq, j++,    DPPart = DP[[All, j - 1]];    For[i = 1, i <= N, i++,     DP[[i, j]] = Dot[DPPart, TM[[All, i]]]*EM[[i, Part[Seq, j]]];     ];    ];   DP   ] 

Essentially DP[[i, j]] = Dot[DPPart, TM[[All, i]]]*EM[[i, Part[Seq, j]]]; This line is the update. Which simply takes the needed columns from Transition (TM), The alpha (DP) matrices and multiplies with the associated Emission value.

How can I have a scaling factor independent of i, when not all the t-s of the row of the DP matrix are filled?

Background and Notation

Let $x_1, ldots, x_T$ be the observations, and $z_1, ldots, z_T$ be the hidden states. The forward-backward recursions are written in terms of $alpha(z_n)$ and $beta(z_n)$.

$alpha(z_n) = p(x_1,ldots,x_n,z_n)$ is the joint distribution of all the heretofore observed data and the most recent state. As $n$ gets large, this gets very small. This is just because of simple properties of probabilities. Probabilities of subsets get smaller. This is axiomatic.

$beta(z_n) = p(x_{n+1},ldots,x_T|z_n)$ is the probability of all the future data given the current state. This gets small as $n$ goes down, for the same reasons as above.

Forward Algorithm

You can see that both of these things become very small, and hence cause numerical issues, whenever you have a non-tiny dataset. So instead of using the forward recursion $$ alpha(z_n) = p(x_n|z_n) sum_{z_{n-1}}alpha(z_{n-1})p(z_n|z_{n-1}) $$ use the filtering recursion: $$ p(z_n|x_1,ldots,x_n) = frac{p(x_n|z_n)sum_{z_{n-1}} p(z_n|z_{n-1})p(z_{n-1}|x_1,ldots,x_{n-1})}{p(x_n|x_1,ldots,x_{n-1})}. $$ You can get this algebraically by dividing both sides of the first recursion by $p(x_1,ldots,x_n)$. But I would program it by keeping track of the filtering distributions, not the $alpha$ quantities. You also don't need to worry about the normalizing constant usually–the last step you can normalize your probability vector if your state space isn't too large.

I am more familiar with the non-HMM state space model literature, and the filtering recursions are probably the most common set of recursions. I have found a few results on the internet where HMM people call this procedure "scaling." So it's nice to find this connection and figure out what all this $alpha$ stuff is.

Backward Algorithm

Next, the traditional HMM backward algorithm is stated as $$ beta(z_n) = sum_{z_{n+1}}beta(z_{n+1})p(x_{n+1}|z_{n+1})p(z_{n+1}|z_n). $$ This one is trickier though. I have found a few websites that say to divide both sides by $p(x_1,ldots,x_n)$ again, and I have also found resources that say to divide both sides by $p(x_{t+1},ldots,x_T|x_1,ldots,x_t)$. I'm still working this out, though. Also, I suspect there's another connection between this and the backward recursions in other areas of state space model literature (for marginal and joint smoothing distributions). I'll keep you posted.


It turns out if you multiply both sides of the HMM backward equation above by $p(x_1, ldots, x_n,z_n)/p(x_1,ldots,x_T)$ you get the backward equation for the marginal smoothing distribution, which are more common in non-HMM state space models.

$$ p(z_n|x_1,ldots,x_T) = p(z_n|x_1,ldots,x_n) sum_{z_{n+1}}frac{p(z_{n+1}|z_n)}{p(z_{n+1}|x_1,ldots,x_n)}p(z_{n+1}|x_1,ldots,x_T). $$

I don't know if this is standard for using Baum-Welch, but it's pretty standard with other state space models. Cool.

Edit 2: define on page 63 the normalized backward function as

$$ hat{beta}(z_n) = beta(z_n)frac{p(x_{1:n})}{p(x_{1:T})} = frac{p(z_n mid x_{1:T}) }{p(z_n mid x_{1:n})}, $$ which, as they mention on page 65, has the interpretation of the ratio of two marginal smoothing distributions.

Similar Posts:

Rate this post

Leave a Comment