Solved – what is the intuition behind separate activation/memory paths in LSTM’s

See the diagram below, which is taken from a homework in Andrew Ng's course on Sequence Models on Coursera (this question is not related to a homework task per se, just for general edification).

I'm wondering what the intuition is, if any, behind passing two values in between time steps: both the memory cell value and the activation value. As you may know, in GRU's we dispense with this and only pass a memory cell value forward (there is no separate activation). But I assume there must be some reason why LSTM's decide to do both.

Relatedly, what is the intuition behind using the activation values to compute the gates and candidate values, but not the memory cell values? Of course sometimes we also use the memory cell values by modifying the LSTM to use peephole connections, but it's not present in the default setup.

Any further intuition you can provide would be appreciated.

a diagram of an LSTM model

In a vanilla RNN, there is only the activation path (also often referred to as the hidden-state $h_t$).

LSTM added on the cell memory $c_t$ as a way to store information over long time-spans in particular.

GRU, which was developed later, simplifies the LSTM by combining both the cell memory and the hidden-state.

Therefore the intuition is that the cell memory stores longer-term information while the hidden-state is still used the same way as it is used in vanilla RNN, but later on, we discovered that it's possible to combine the two without too much degradation in performance. In other words, GRU is a sort of refinement on the ideas from LSTM.

The cell memory is used to modify the activation path before the output, so intuitively, whatever relevant information in the long-term memory can be dumped into the short-term memory (the activations) right before an output needs to be extracted

Similar Posts:

Rate this post

Leave a Comment