During wake, facts from conversation are injected directly into MLP weights via MEMIT (a single forward pass, instant recall). During sleep, the system audits which memories degraded, refreshes them with null-space constraints (guaranteeing orthogonality to working memories), then progressively transfers knowledge into LoRA — like biological memory consolidation from hippocampus to neocortex.
The key problem was a hard capacity ceiling: the 8B model sustains 0.92 recall up to 13 facts, then crashes to 0.57 at fact 14 — a sharp phase transition, not gradual decay. And LoRA consolidation was blocked by what I call the "alignment tax": RLHF training fights back against injected knowledge (37% recall loss on 8B from a single LoRA pass).
The fix: per-fact graduated consolidation. Each fact independently tracks its own stage and advances only when LoRA proves it absorbed that specific fact. A dissolution schedule (1.0 → 0.5 → 0.1 → 0.0) gradually removes the MEMIT edit as LoRA takes over. And cumulative fusing — training each cycle on the already-fused model — reduces the alignment tax from catastrophic to negligible (starting loss drops 2.91 → 0.62 by cycle 2).
Results on Llama 3.1 8B (4-bit, 2×H100): - 100% advancement rate at 5/10/15/20 facts - 1.00 chat recall at all scales - MEMIT edits dissolve on schedule, making the buffer renewable - Effective lifetime capacity: unbounded
There's also a biological curiosity: individual facts consolidate at different rates. One synthetic fact ("Aria lives in Portland") is consistently the hardest across very run — some memories are just harder to absorb, same as in biological systems.
6 papers documenting the full journey from initial LoRA prototype to this result: https://doi.org/10.5281/zenodo.18779159
Built with: Python, PyTorch, PEFT, BitsAndBytes, Llama 3.1. Runs on MacBook Air (3B) or H100 (8B/70B).