Encoder Decoder comes from the original transformers implementation way back in 2017. If you look at figure 1 you'll see what the first transformer ever looked like.
Since that time different implementations of transformers use either just the encoder portion, or the decoder portion, or both. Its a deep topic so hard to summarize here, but Gemini explains it really well! Hope this gets you started on some prompting to learn more
The original transformer paper from google was encoder-decoder, but then encoder BERT was hot and then decoder GPT was hot; now encoder-decoder is hot again!
Decoders are good at generative tasks - chatbots etc.
Encoders are good at summaration.
Encoder decoders are better at summaration. It’s steps towards “understanding” (quotes needed).
A encoder-decoder model splits input and output. This makes sense for translation tasks, summarization, etc. They're good when there's a clear separation of "understand the task" and "complete the task", but you can use it for anything really. A example would be send "Translate to english: Le chat est noir." to the encoder, the encoder processes everything in a single step, that is understand the task as a whole, then the output of the encoder is fed to the decoder and then the decoder runs one token at a time.
GPT ditches the encoder altogether and just runs the decoder with some slight changes, this makes it more parameter efficient but tends to hallucinate more due to past tokens containing information that might be wrong. You can see it as the encoder running on each token as they are read/generated.
minimaxir•40m ago
I get not trying to cannibalize Gemma, but that's weird. A 540M multimodel model that performs well on queries would be useful and "just post-train it yourself" is not always an option.
jeffjeffbear•27m ago