One of the often cited issues in RNN training is the vanishing gradient problem [1,2,3,4].
However, I came across several papers by Anton Maximilian Schaefer, Steffen Udluft and Hans-Georg Zimmermann (e.g. ) in which they claim that the problem doesn't exist even in a simple RNN, if shared weights are used.
So, which one is true – does the vanishing gradient problem exist or not?
Learning long-term dependencies with gradient descent is difficult by Y.Bengio et al. (1994)
The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions by S.Hochreiter (1997)
Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies by S.Hochreiter et al. (2003)
On the difficulty of training Recurrent Neural Networks by R.Pascanu et al. (2012)
Learning long-term dependencies with recurrent neural networks by A.M. Schaefer et al. (2008)
First let's restate the problem of vanishing gradients. Suppose you have a normal multilayer perceptron with sigmoidal hidden units. This is trained by back-propagation. When there are many hidden layers the error gradient weakens as it moves from the back of the network to the front, because the derivative the sigmoid weakens towards the poles. The updates as you move to the front of the network will contain less information.
RNNs amplify this problem because they are trained by back-propagation through time (BPTT). Effectively the number of layers that is traversed by back-propagation grows dramatically.
The long short term memory (LSTM) architecture to avoids the problem of vanishing gradients by introducing error gating. This allows it to learn long term (100+ step) dependencies between data points through "error carousels."
A more recent trend in training neural networks is to use rectified linear units, which are more robust towards the vanishing gradient problem. RNNs with sparsity penalization and rectified linear unit apparently work well.
See Advances In Optimizing Recurrent Networks.
Historically neural networks performance greatly depended on many optimization tricks and the selection of many hyperparameters. In the case of RNN you'd be wise to also implement rmsprop and Nesterov’s accelerated gradient. Thankfully, the recent developments in dropout training have made neural networks more robust towards overfitting. Apparently there is some work towards making dropout work with RNNs.
See On Fast Dropout and its Applicability to Recurrent Networks
- Solved – Does the vanishing gradient in RNNs present a problem
- Solved – If we primarily use LSTMs over RNNs to solve the vanishing gradient problem, why can’t we just use ReLUs/leaky ReLUs with RNNs instead
- Solved – Why do we need both cell state and hidden value in LSTM networks
- Solved – ny paper about vanishing-gradients of LSTM
- Solved – Vanishing gradient vs. dying ReLU?