Obviously, you can't do it in pre-training. But you can add it later as an optional 'extra' vector, I think. E.g. `input_embedding + MLP(prev_output) * alpha`. Alpha is zero during pre-training.
What if you trained a separate thinking phase using the auto encoder, though? Might be more efficient, and then you've got it using neuralese internally.
Actually, reading the (summary) paper - they tried your idea and had trouble with it for a different reason:
> Once the generative head predicts the next vector , a natural next step would be to feed it directly as input to the Transformer for predicting . However, we found that the model struggles to unpack the semantic information from such a compact representation. Instead, we ground the autoregressive process back in the more structured discrete space, where the predicted is passed through the autoencoder to reconstruct the K tokens.Still — props to the team for going after the real root of inefficiency, not just piling on more layers. If nothing else, this is one to watch if you care about scaling models smarter.
When I'm thinking about math proofs, sometimes I can have a single idea which can be unfolded into a hundred lines of proof
Maybe I'm getting the wrong analogy here, but if vectors = ideas then K should depend on the vector
I also wonder how far they can push K if other aspects are tweaked. The approach of just doubling each parameter each time leaves a lot of space between the chosen value and the next value known to not work.
mentalgear•2mo ago
- Diversity: This term encourages the model to generate a diverse set of samples, preventing mode collapse. - Fidelity: This term rewards the model for making predictions that are close to the ground-truth
I'm wondering if a continuos next-vector generative approach also increase innate "reasoning" capabilities of the model, since it could potentially capture more of the semantics of the data vs just tokens.
barrenko•2mo ago
mike_hearn•2mo ago