Solved – How to avoid ‘Catastrophic forgetting’

I read this article by Michael Honiball (creator of Spacy) in which he talks about the 'Catastrophic Forgetting' problem.

Here he says that when we want to fine-tune a pre-trained model to add a new label or correct some specific errors, it can introduce the 'Catastrophic Forgetting' problem (loses its generality). To combat this, he proposes a technique called Pseudo-rehearsal. He says to predict a number of examples with the initial model, and mix them through the fine-tuning data and uses it as an objective for the model.

So does this mean that I use $hat{Y}(The Predicted Value)$ instead of $Y(The Ground Truth)$ generated by the initial model, mix it with the new $Y$ values of the newly obtained data points and use it to train my model?

Am I correct? Can someone elaborate?

Catastrophic forgetting is a inherent problem in neural networks. From Wikipedia,

(Catastrophic forgetting) is a radical manifestation of the 'sensitivity-stability' dilemma or the 'stability-plasticity' dilemma. Specifically, these problems refer to the issue of being able to make an artificial neural network that is sensitive to, but not disrupted by, new information. Lookup tables and connectionist networks lie on the opposite sides of the stability plasticity spectrum. The former remains completely stable in the presence of new information but lacks the ability to generalize, i.e. infer general principles, from new inputs.

What is Catastrophic forgetting? Let's consider two tasks: Task A and task B. Now, suppose we're using a pre-trained model that is already pretty good on task A (learned weights $theta_A$), and we want to "fine-tune" it to also fit task B. The common practice is to take the weights of a model trained on task A and use them as initialization for training on task B. This works well in applications in which task B is a "sub-task" of task A (e.g task B is detecting eyeglasses, and task A is detecting faces). When B is not a sub-task of A, there is the fear that catastrophic forgetting will occur: essentially, the network will use the same neurons that were before optimized for task A, for predicting on task B. In doing this, it will completely lose its ability to classify instances of task A correctly. You can actually experiment with this yourself: You can building a small network that can tell whether an MNIST image is a 5 or not a 5 and measure it's accuracy at this task; If you then go on to finetune this model to the task of telling whether an MNIST image is a 4 or not, you will note that the accuracy of the final model on the original task (recognizing 5) has worsened.

A Naive Solution. The naive solution to catastrophic forgetting would be to not only initialize the weights of the finetuned model to be $theta_A$, but also add regularization: penalize the solution of the finetuned model when it get's far from $theta_A$. Essentially, this means the the objective will be to find the best solution for task B that it still similar to $theta_A$, the solution to task A. The reason why we call this a naive approach is that it often doesn't work well. The functions learned by neural networks are often very complicated and far from linear, so a small change in parameter values (i.e $theta_B$ being close to $theta_A$) can still lead to very different outcomes (i.e $f_{theta_A}$ is very different from $f_{theta_B}$). Since it's the outcomes we care about, this is bad for us.

Pseudo-rehearsal. A better approach would be to try to be good on task B while simultaneously giving similar answers to the answers given by $f_{theta_A}$. The good thing is that this approach is very easy to implement: Once you have learned $theta_A$, we can use that model to generate an infinite number of "labeled" examples $(x,f_{theta_A}(x))$. Then, when training the fine-tuned model, we will alternate between examples labeled for task B and examples of the form $(x,f_{theta_A}(x))$. You can think about the latter as "revision exercises" that make sure that our network does not lose it's ability to handle task A while learning to handle task B.

An even better approach: add memory. As humans, we are good both in generalizing (plasticity) using new examples and in remembering very rare events, or maintaining skills we didn't use for a while (stability). In many ways the only method to achieve something similar with deep neural networks, as we know them, is to incorporate some form of "memory" into them. This it outside the scope of your question but it is an interesting and active field of research so I though I'd mention it. See this example recent work: LEARNING TO REMEMBER RARE EVENTS.

Similar Posts:

Rate this post

Leave a Comment