(also: catastrophic interference
) is a term, often used in connectionist literature, to describe a common problem with many traditional artificial neural network
models. It refers to the catastrophic loss of previously learned responses, whenever an attempt is made to train the network with a single new (additional) response. Affected networks include, for example, backpropagation learning networks.
Catastrophic forgetting is a primary reason artificial neural networks are not able to continuously
learn from their surroundings. Traditionally, such networks must be fully trained on a complete set of expected responses prior to being put into service, where learning
must then be disabled.
Catastrophic forgetting is a specific sense of a more general phenomenon called, interference
. Catastrophic forgetting is sometimes called catastrophic interference
in connectionist literature. The normal form of interference, which is observed in natural learning systems, such as humans, tends to cause gradual losses. In artificial systems, however, the losses caused by interference are, well, catastrophic.
In an apparent effort to explain away the problem that traditional artificial neural networks have had with catastrophic forgetting, the stability-plasticity problem
is sometimes promoted as a dilemma. This notion has often been pushed, even in light of natural systems, which have obviously overcome the problem. More recently, the problem has been fully overcome in artificial neural networks as well (see multi-temporal synapses
. . . . . . .
The Problem — Training A Set vs Training A Single New Response
The reason training on an entire set differs from simply adding a single new response is that, during training, each response in the set is used to move the weights
slightly on each training iteration. Training of the mappings proceeds in an interleaved
fashion. That is, the entire set is cycled through multiple times during training, where each response is only slightly trained in each cycle. If a new response is simply added on to an existing set, it would, necessarily, be trained completely by itself, without the other responses having a chance to be reinforced. To put it another way...
This doesn't work.
- Fully train response 1, then
- Fully train response 2, then
- . . .
- When all have been fully trained,
- Done (?)
This does work!
- Slightly train response 1, then
- Slightly train response 2, then
- . . .
- When all slightly trained,
- repeat until all are fully trained
With a little thought, the mechanisms are easy to understand. Each weight in a given weight-space represents a small part of many different responses to many different situations
(i.e., stimulus patterns). For any given weight, when a network first learns to respond to a set of patterns, it will find a value that represents a tradeoff, which allows its neuron to respond to all of the patterns that it has learned. However, when the network learns to respond to a single new pattern of stimuli, it will adjust each individual weight to an optimal value that allows it to only respond to the new pattern.
Assume a training set of only two pattern/responses for simplicity. To respond to both patterns in the set, the network must find a value for any given connection weight that can provide proper responses to both patterns on the output. The individual value may not be optimal for either response, but it will comprise acceptable response-values for both patterns.
This characteristic of how responses are stored in connection-strengths is not a bad thing. By representing many responses to many situations within each individual connection-weight value, the network is able to generalize, and begin to respond correctly to many novel variations of the set on which it had been previously trained. This is true of biological neural networks as well.
. . . . . . .
Historically, the normal way of dealing with catastrophic forgetting in artificial neural networks has been to simply train on a broad enough set of exemplar responses to begin with, and, once trained, use the network in non-learning (response-only) mode.
It is often the case, though, that an existing, fully-trained, network might need to have some new response-mapping added to its repertoire. These cases have generally been handled by starting over with a blank (completely untrained) network. The blank network would then be trained on a set of exemplars that included the entire original set, plus the new response mapping. This procedure is called "rehearsal."
Another possible way to resolve this problem might be to start with the existing weight-set
, rather than to begin with a blank network. To train the new response, one could continue to cycle through the entire original set, plus the new response.
Another re-training procedure, known as "pseudorehearsal," has also been shown to provide very good results, without the need to store the original training set (see references below). Certainly, many studies of such options have been performed.
. . . . . . .
Multitemporal Synapses Provide a More General Solution
. . . . . . .
Relatively recently, a new learning structure and method called multi-temporal synapses has been developed. Among other things, multitemporal synapses eliminate the problems associated with catastrophic forgetting in artificial neural networks. The idea works by embracing forgetting as simply an inevitable, and even necessary, part of continuously adapting to present-moment details. Multitemporal synapses are able to learn, and continuously forget, at different rates. This, in turn, allows the system to continuously learn and adapt to each new present moment that happens along.
Slower (or permanent) weights in multitemporal synapses learn from their faster counterparts at the same connection-points. Because of this, they are able to be trained gradually, by many different present-moment experiences, over time.
The function of the fast-learning weights is to quickly learn to respond to each present moment as it is encountered, and to just as quickly forget those lessons when no longer needed. The slower weights absorb the various encounters in an interleaved fashion, and at a slow rate. They are, therefore, continuously, gradually, trained on a repertoire of multiple present moments as they pass. In the general case, the repertoire will be the most recently relevant sub-set of all present moments experienced.