Author here.
I have been working on catastrophic forgetting on transformers. And I found a potential solution with strong results. I have a weird attention free encoder that treats embeddings as waves instead of vectors. Motivation and summary of solution is below.
Pretrained embeddings starts to learn faster, this is a known thing and used for low resource NLP. So, if we could scructure the embedding "map" faster, decoder should learn a lot faster.
So, we isolated the alignment cost of this map with a method we call ISMR. We found 20 layered model with 14.5% embeddings does not learn faster than 1 layered model with 80% embeddings.
Then, we invented a weird encoder we call "PRISM". It treats embeddings as waves instead of vectors. It teleports embeddings to relevant frequencies rapidly. It looks like it can learn new concepts 5-shot with nearly no forgetting (-0.7 to -0.84 BLEU) while standard transformer encoder decoder suffers catastrophic forgetting (more than 10 BLEU loss).
Yujivus•38m ago
Pretrained embeddings starts to learn faster, this is a known thing and used for low resource NLP. So, if we could scructure the embedding "map" faster, decoder should learn a lot faster.
So, we isolated the alignment cost of this map with a method we call ISMR. We found 20 layered model with 14.5% embeddings does not learn faster than 1 layered model with 80% embeddings.
Then, we invented a weird encoder we call "PRISM". It treats embeddings as waves instead of vectors. It teleports embeddings to relevant frequencies rapidly. It looks like it can learn new concepts 5-shot with nearly no forgetting (-0.7 to -0.84 BLEU) while standard transformer encoder decoder suffers catastrophic forgetting (more than 10 BLEU loss).