T5Gemma 2: The next generation of encoder-decoder models

https://blog.google/technology/developers/t5gemma-2/

153•milomg•1mo ago

Comments

minimaxir•1mo ago

> Note: we are not releasing any post-trained / IT checkpoints.

I get not trying to cannibalize Gemma, but that's weird. A 540M multimodel model that performs well on queries would be useful and "just post-train it yourself" is not always an option.

jeffjeffbear•1mo ago

Isn't finetuning the point of the T5 style models, since they perform better for smaller parameter counts?

refulgentis•1mo ago

It’ll be a major pain in the ass to replicate exactly what they did to make it long context and multimodal. Sucks too because the smol Gemma 3s with same parameter count were neither.

jeffjeffbear•1mo ago

> https://huggingface.co/google/t5gemma-2-1b-1b

From here it looks like it still is long context and multimodal though?

>Inputs and outputs Input:

Text string, such as a question, a prompt, or a document to be summarized

Images, normalized to 896 x 896 resolution and encoded to 256 tokens each

Total input context of 128K tokens Output:

Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document

Total output context up to 32K tokens

rhdunn•1mo ago

If you are finetuning the model you need to replicate the training conditions so you don't remove those capabilities. If you just finetune a multi-modal model on text it will lose some of the vision capabilities as the text part of the model will drift from the vision, audio, etc. models. A similar thing happens with finetuning reasoning models.

Even if you did finetune the models with text and images then you could run into issues with using different descriptions for images to what it was trained with. Though you could probably work around that by getting the model to describe the images, but you'll still need to audit the results to correct any issues or add what you are training for.

You can also run into overfitting if your data does not include enough variations along a given training set that the original model had access to.

Using different training parameters could also affect the models capabilities. Just knowing things like the input context isn't enough.

CuriouslyC•1mo ago

This is the thing that kills me about SFT. It was sensible when most of the compute in a model was in pretraining and the RL was mostly for question answering. Now that RL is driving model capabilities it doesn't make much sense.

On the other hand, RL on deployed systems looks promising to essentially JIT optimize models. Experiments with model routers and agentic rag have shown good results.

navvyeanand•1mo ago

This is very true. However, I wonder how much of this can be mitigated by using training data from other open-source models like Olmo3 for textual data, Emu3.5 for vision?

sundarurfriend•1mo ago

This made me compare the figures, and: did they accidentally switch those around, or are the Post-training Reasoning and Factuality scores actually significantly lower than the Pre-training ones?

Edit: Just noticed

> Also note pre-training and post-training benchmarks are different, so scores are not comparable across plots.

The paper gives more details about the specific benchmarks and the scores obtained in them: https://arxiv.org/html/2512.14856v1#S4

davedx•1mo ago

What is an encoder-decoder model, is it some kind of LLM, or a subcomponent of an LLM?

wongarsu•1mo ago

The announcement of the original T5Gemma goes in some more detail [1]. I'd describe it as two LLMs stacked on top of each other: the first understands the input, the second generates the output. "Encoder-decoder models often excel at summarization, translation, QA, and more due to their high inference efficiency, design flexibility, and richer encoder representation for understanding input"

1: https://developers.googleblog.com/en/t5gemma/

canyon289•1mo ago

Hi, I'm not on the t5 Gemma team but work on gemma in general.

Encoder Decoder comes from the original transformers implementation way back in 2017. If you look at figure 1 you'll see what the first transformer ever looked like.

Since that time different implementations of transformers use either just the encoder portion, or the decoder portion, or both. Its a deep topic so hard to summarize here, but Gemini explains it really well! Hope this gets you started on some prompting to learn more

https://arxiv.org/pdf/1706.03762

wood_spirit•1mo ago

A decoder predicts the next word (token) to iteratively generate a whole sentence. An encoder masks a word in the middle of a sentence and tries to predict that middle.

The original transformer paper from google was encoder-decoder, but then encoder BERT was hot and then decoder GPT was hot; now encoder-decoder is hot again!

Decoders are good at generative tasks - chatbots etc.

Encoders are good at summaration.

Encoder decoders are better at summaration. It’s steps towards “understanding” (quotes needed).

nodja•1mo ago

It's an alternate architecture of LLMs, they actually predate modern LLMs. An encoder-decoder model was actually the model used in the "Attention if all you need" paper that introduced the transformer and essentially gave birth to modern LLMs.

A encoder-decoder model splits input and output. This makes sense for translation tasks, summarization, etc. They're good when there's a clear separation of "understand the task" and "complete the task", but you can use it for anything really. A example would be send "Translate to english: Le chat est noir." to the encoder, the encoder processes everything in a single step, that is understand the task as a whole, then the output of the encoder is fed to the decoder and then the decoder runs one token at a time.

GPT ditches the encoder altogether and just runs the decoder with some slight changes, this makes it more parameter efficient but tends to hallucinate more due to past tokens containing information that might be wrong. You can see it as the encoder running on each token as they are read/generated.

Edit: On re-read I noticed it might not be clear what I mean by past tokens containing wrong information. I mean that for each token the model generates a hidden state, those states don't change, so for example an input of 100 tokens will have 100 hidden states, the states are generated at once on the encoder model, and one token at a time on the decoder models. Since the decoder doesn't have the full information yet, the hidden state will contain extra information that might not having anything to do with the task, or even confuse it.

For example if you give the model the task "Please translate this to chinese: Thanks for the cat, he's cute. I'm trying to send it to my friend in hong kong.". For a enc-dec model it would read the whole thing at once and understand that you mean cantonese. But a decoder only model would "read" it one token a time it could trip in several places, 1. assume chinese means mandarin chinese not cantonese, 2. assume that the text after "cute." it's something to also translate and not a clarification. This would have several token worth of extra information that would confuse the model. Models are trained with this in mind so they're used to tokens having lots of different meanings embeded in them, then having later tokens narrow down the meanings, but it might cause models to ignore certain tokens, or hallucinate.

subscribed•1mo ago

Your last paragraph is amazing, and explains that so clearly, thank you!

potatoman22•1mo ago

What's the use case of models like T5 compared to decoder-only models like Gemma? More traditional ML/NLP tasks?

sigmoid10•1mo ago

They trained it to be used like any other decoder only model. So text generation essentially. But you could use the encoder part for things like classification without much effort. Then again you can also slap a classifier head on any decoder model. The main reason they seem to be doing this is to have swappable encoder/decoder parts in an otherwise standard LLM. But I'm not sure if that is really something we needed.

VHRanger•1mo ago

Encoder/decoder is much, much more efficient for finetuning and inference than decoder-only models.

Historically T5 are good when you finetune them for task specific models (translation, summarization, etc).

sigmoid10•1mo ago

I have actually worked on encoder-decoder models. The issue is, finetuning itself is becoming historic. At least for text processing. If you spend a ton of effort today to finetune on a particular task, chances are you would have reached the same performance using a frontier LLM with the right context in the prompt. And if a big model can do it today, in 12 months there will be a super cheap and efficient model that can do it as well. For vision you can still beat them, but only with huge effort the gap is shortening constantly. And T5 is not even multimodal. I don't think these will change the landscape in any meaningful way.

VHRanger•1mo ago

This t5 is multimodal.

Also a hint: you can create a finetuning dataset from a frontier LLM pretty easily to finetune those t5 and effectively distill them pretty fast these days

refulgentis•1mo ago

Only thing it buys you is a more “natural” embedding, i.e. the encoder can get you a bag o’ floats representing a text, but that also doesn’t mean it’s naturally a good embedding engine - I strongly assume you’d do further training.

Decoder gets you the autoregressive generation you’d use for an llm.

Beyond that, there’s this advantage of having small LLMs train better, they kinda hit a wall a year or two ago IMHO. E.g. original Gemma 3 small models were short context and only text.

As far as I understand you have to pay for that by 2x inference cost at runtime

(Would be happy to be corrected on any of the above, I maintain a multi platform app that has llama.cpp inference in addition to standard LLMs, and I do embeddings locally, so I’m operating from a practical understanding more than ML phd)

VHRanger•1mo ago

In general encoder+decoder models are much more efficient at infererence than decoder-only models because they run over the entire input all at once (which leverages parallel compute more effectively).

The issue is that they're generally harder to train (need input/output pairs as a training dataset) and don't naturally generalize as well

GaggiX•1mo ago

≥In general encoder+decoder models are much more efficient at infererence than decoder-only models because they run over the entire input all at once (which leverages parallel compute more effectively).

Decoder-only models also do this, the only difference is that they use a masked attention.

DoctorOetker•1mo ago

What is the "X" in the pentagonal performance comparison, is it multilingual performance or something else?

killerstorm•1mo ago

They are comparing 1B Gemma to 1+1B T5Gemma 2. Obviously a model with twice more parameters can do more better. Says absolutely nothing about benefits of the architecture.

kamranjon•1mo ago

You may not have seen this part: "Tied embeddings: We now tie the embeddings between the encoder and decoder. This significantly reduces the overall parameter count, allowing us to pack more active capabilities into the same memory footprint — crucial for our new compact 270M-270M model."

killerstorm•1mo ago

I have seen this part. In fact I checked the paper itself where they provide more detailed numbers: it's still almost a double of the base Gemma, reuse of embeddings and attention doesn't make that much difference as most weights are in MLP s

yorwba•1mo ago

Since the encoder weights only get used for the prefixed context and then the decoder weights take over for generation, the compute requirements should be roughly the same as for the decoder-only model. Obviously an architecture that can make use of twice the parameters in the same time is better. They should've put some throughput measurements in the paper, though...

o1inventor•1mo ago

> 128k context.

don't care. prove effective context length or gtfo.

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

Speed up responses with fast mode

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

FDA intends to take action against non-FDA-approved GLP-1 drugs

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: A luma dependent chroma compression algorithm (image compression)

Show HN: Craftplan – Elixir-based micro-ERP for small-scale manufacturers

Eigen: Building a Workspace

Al Lowe on model trains, funny deaths and working with Disney

The F Word

Start all of your commands with a comma (2009)

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

I write games in C (yes, C) (2016)

The AI boom is causing shortages everywhere else

Selection rather than prediction

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Learning from context is harder than we thought

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Hackers (1995) Animated Experience

72M Points of Interest

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

Speed up responses with fast mode

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

FDA intends to take action against non-FDA-approved GLP-1 drugs

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: A luma dependent chroma compression algorithm (image compression)

Show HN: Craftplan – Elixir-based micro-ERP for small-scale manufacturers

Eigen: Building a Workspace

Al Lowe on model trains, funny deaths and working with Disney

The F Word

Start all of your commands with a comma (2009)

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

I write games in C (yes, C) (2016)

The AI boom is causing shortages everywhere else

Selection rather than prediction

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Learning from context is harder than we thought

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Hackers (1995) Animated Experience

72M Points of Interest

T5Gemma 2: The next generation of encoder-decoder models

Comments