I'd love to see a full circle of hypernetworks, with both models continuously updated through generated LoRAs, the hypernetwork updated to accommodate the new model state. You'd need a meta-hypernetwork to apply LoRAs to the hypernetwork, and then you could effectively have continuous learning.
I’m sure this is something the big labs are trying but from the outside as a user of LLMs it feels like people don’t talk about this very much and instead the focus right now is on better training (eg reinforcement learning) with the assumption that anything else not learned during training will be stuffed into the context somehow as needed. But from a naive perspective the lack of learning from experience after training seems like the biggest thing standing between us and AGI.
There are tons of benchmarks around this you can easily run with 1 gpu.
It's compute only in the sense that the only way to do it is retrain a model from scratch at every step.
If you solve CL with a CNN you just created AGI.
Hypothetically (and perhaps more plausibly), a continually learning model that adapts to the context of a particular org / company / codebase / etc., could even be desirable.
does a model need to retain its generality?
Only if you want it to remain smart.
I completely agree that figuring out a safe way to continually train feels like the biggest blocker to AGI
Many people here are right, compute, collapse, forgetting whatever.
The only "real" way to do this would be: 1. Train a model 2. New data 3. Retrain the model in full + new data 4. Repeat 5. You still have no garuntee on the "time" aspect though.
But CL as a field basically has zero answers on how to do this in a true sense. It's crazy hard because the "solutions" are hypocritical in many ways.
We need to expand the model's representation space while keeping the previous representation space nearly the same?
Basically, you need to modify it without changing it.
Most annoying is that even the smallest of natural brains do this easily. I have a long winded theory but basically it boils down to AI likely needs to "sleep" or rest somehow.
But the clone couldn't run without sleeping? So that's more of a teammate than a clone.
1 works while the other sleeps and then swap.
If this method ever worked our current alignment methods get chucked out the window those would be two completely different AI.
1. Preventing collapse -> model gets "full" https://arxiv.org/pdf/1612.00796
2. Forgetting causes better generalization https://arxiv.org/abs/2307.01163
3. Unknow paper that connects this - allow a "forgetting" model that improves generalization over time. - I tried for a long time to make this but it's a bit difficult
Fun implication is that if true this implies AGI will need "breaks" and likely need to consume non task content of high variety much like a person does.
LoRA paper: https://arxiv.org/abs/2106.09685
Two things that stand out:
- The knowledge incorporation results (47% vs 46.3% with GPT-4.1 data, both much higher than the small-model baseline) show the model does discover better training formats, not just more data. Though the catastrophic forgetting problem remains unsolved, and it's not completely clear whether data diversity is improved.
- The computational overhead is brutal - 30-45 seconds per reward evaluation makes this impractical for most use cases. But for high-value document processing where you really need optimal retention, it could be worth it.
The restriction to tasks with explicit evaluation metrics is the main limitation. You need ground truth Q&A pairs or test cases to compute rewards. Still, for domains like technical documentation or educational content where you can generate evaluations, this could significantly improve how we process new information.
Feels like an important step toward models that can adapt their own learning strategies, even if we're not quite at the "continuously self-improving agent" stage yet.
"Forgetting correctly" is something most human brains are exceptionally good at, too. I wonder how that works...
They don't just "forget" that information can come back at a later time if you continue to train.
So basically any time a model is trained you need to check it's entire memory not just a small part.
This is often associated with learning tools like anki and stuff, but the real world is all about encountering things at certain frequencies (day night cycles, seasons, places you visit, people you see.... everything, really)
I'm wondering if there maybe some sort of inverse to SR, maybe?
"NEAT/HyperNEAT" (Neuroevolution of Augmented Topologies) [0]
I'm no ML practictioner, but as I understood it, the primary difference between NEAT and what is described in this paper is that while NEAT evolves the topology of the network, this paper seems to evolve the weights.
Seems like two approaches trying to solve the same problem -- one evolving networking structure, and the other the weights.
Those 2 friends are quite possibly the most intelligent people I've ever met, and they were very convinced that RL and evolutionary algorithms were the path forward in ML.
[0] https://en.wikipedia.org/wiki/Neuroevolution_of_augmenting_t...
2028 is pretty much tomorrow… fascinating insight
all2•14h ago
dang•12h ago