(also: catastrophic interference
) is a term, often used in connectionist literature, to describe a common problem with many traditional artificial neural network
models. It refers to the catastrophic loss of previously learned responses, whenever an attempt is made to train the network with a single new (additional) response. Affected networks include, for example, backpropagation learning networks.
Catastrophic forgetting is a primary reason artificial neural networks are not able to continuously
learn from their surroundings. Traditionally, such networks must be fully trained on a complete set of expected responses prior to being put into service, where learning
must then be disabled.
Catastrophic forgetting is a specific sense of a more general phenomenon called, interference
. Catastrophic forgetting is sometimes called catastrophic interference
in connectionist literature. The normal form of interference, which is observed in natural learning systems, such as humans, tends to cause gradual losses. In artificial systems, however, the losses caused by interference are, well, catastrophic.
In an apparent effort to explain away the problem that traditional artificial neural networks have had with catastrophic forgetting, the stability-plasticity problem
is sometimes promoted as a dilemma. This notion has often been pushed, even in light of natural systems, which have obviously overcome the problem. More recently, the problem has been fully overcome in artificial neural networks as well (see multi-temporal synapses
. . . . . . .
The Problem — Training A Set vs Training A Single New Response
The reason training on an entire set differs from simply adding a single new response is that, during training, each response in the set is used to move the weights
slightly on each training iteration. Training of the mappings proceeds in an interleaved
fashion. That is, the entire set is cycled through multiple times during training, where each response is only slightly trained in each cycle. If a new response is simply added on to an existing set, it would, necessarily, be trained completely by itself, without the other responses having a chance to be reinforced. To put it another way...
This doesn't work.
- Fully train response 1, then
- Fully train response 2, then
- . . .
- When all have been fully trained,
- Done (?)
This does work!
- Slightly train response 1, then
- Slightly train response 2, then
- . . .
- When all slightly trained,
- repeat until all are fully trained
With a little thought, the mechanisms are easy to understand. Each weight in a given weight-space represents a small part of many different responses to many different situations
(i.e., stimulus patterns). For any given weight, when a network first learns to respond to a set of patterns, it will find a value that represents a tradeoff, which allows its neuron to respond to all of the patterns that it has learned. However, when the network learns to respond to a single new pattern of stimuli, it will adjust each individual weight to an optimal value that allows it to only respond to the new pattern.
Assume a training set of only two pattern/responses for simplicity. To respond to both patterns in the set, the network must find a value for any given connection weight that can provide proper responses to both patterns on the output. The individual value may not be optimal for either response, but it will comprise acceptable response-values for both patterns.
This characteristic of how responses are stored in connection-strengths is not a bad thing. By representing many responses to many situations within each individual connection-weight value, the network is able to generalize, and begin to respond correctly to many novel variations of the set on which it had been previously trained. This is true of biological neural networks as well.
. . . . . . .
Historically, the normal way of dealing with catastrophic forgetting in artificial neural networks has been to simply train on a broad enough set of exemplar responses to begin with, and, once trained, use the network in non-learning (response-only) mode.
It is often the case, though, that an existing, fully-trained, network might need to have some new response-mapping added to its repertoire. These cases have generally been handled by starting over with a blank (completely untrained) network. The blank network would then be trained on a set of exemplars that included the entire original set, plus the new response mapping. This procedure is called "rehearsal."
Another possible way to resolve this problem might be to start with the existing weight-set
, rather than to begin with a blank network. To train the new response, one could continue to cycle through the entire original set, plus the new response.
Another re-training procedure, known as "pseudorehearsal," has also been shown to provide very good results, without the need to store the original training set (see references below). Certainly, many studies of such options have been performed.
. . . . . . .
Multitemporal Synapses Provide a More General Solution
. . . . . . .
Relatively recently, a new learning structure and method called multi-temporal synapses has been developed. Among other things, multitemporal synapses eliminate the problems associated with catastrophic forgetting in artificial neural networks. The idea works by embracing forgetting as simply an inevitable, and even necessary, part of continuously adapting to present-moment details. Multitemporal synapses are able to learn, and continuously forget, at different rates. This, in turn, allows the system to continuously learn and adapt to each new present moment that happens along.
Slower (or permanent) weights in multitemporal synapses learn from their faster counterparts at the same connection-points. Because of this, they are able to be trained gradually, by many different present-moment experiences, over time.
The function of the fast-learning weights is to quickly learn to respond to each present moment as it is encountered, and to just as quickly forget those lessons when no longer needed. The slower weights absorb the various encounters in an interleaved fashion, and at a slow rate. They are, therefore, continuously, gradually, trained on a repertoire of multiple present moments as they pass. In the general case, the repertoire will be the most recently relevant sub-set of all present moments experienced.
- Catastophic Forgetting in Neural Networks on: 20160504
Smell that? (4.4.3 Levy and Bairaktaris' high-capacity dual-weight model pg 27).
- Mitigation of Catastrophic Forgetting in Recurrent Neural Networks using a Fixed Expansion Layer on: 20160504
Smell that? 2013, and(?) ref 
- The Evolution of Minimal Catastrophic Forgetting in Neural Systems on: 20160504
Smell that? re: "Other approaches have involved allowing two setsof weighted connections between nodes. Hinton & Plaut(1987) used dual-additive weights, with fast weights to learn new patterns and slow weights for long-term storage." dated 2005:(see: Cog-Sci Conference?)
- Adaptation of Artificial Neural Networks Avoiding Catastrophic Forgetting on: 20160504
The state of the art of avoiding catastrophic forgetting in 2006 (i.e., pre-multitemporal synapses
- An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks on: 20160504
Here, we investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions. We also examine the effect of the relationship between the first task and the second task on catastrophic forgetting.
- [pdf] Catastrophic Forgetting in Connectionist Networks: Causes, Consequences and Solutions
"Only rarely (see Box 3) does new learning in natural cognitive systems completely disrupt or erase previously learned information. In other words, natural cognitive systems do not, in general, forget catastrophically. Unfortunately, however, this is precisely what occurs under certain circumstances in distributed connectionist networks."
- [pdf] Catastrophic forgetting in simple networks: an analysis of the pseudorehearsal solution
A quote from the article. "The most common practical solution to this problem is simply to form a new training set which includes all the old items as well as the new one, and learn this enlarged training set.
- [pdf]Catastrophic Interference
Sometimes this is referred to as catastrophic interference. This paper is a general overview of the problems with MLPs and backpropagation. It is presented as a slide-show with supporting text.