https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd....
(In the TinyStories paper, the 22M model after 20k steps has a loss of ~2.4, for the 33M it's expectedly lower - loss ~1.8–2.0)
After converting to bits-per-character, E8-LILA shows a significantly better result (0.128 bpc vs. 0.742 bpc for TinyStories-33M). (bpc calculation: loss / (ln(2) x average token length), for BPE‑2048 ≈ 4.5 characters, for a 10k vocabulary ≈ 3.5 characters.)
(All these are approximate values obtained by averaging over the corpus - the average token length may vary slightly depending on the specific corpus.)
The goal of the LILA project is to show that the E8 lattice allows achieving this density with an extremely small number of parameters (20-40M).
Today I started training a new model with geometric attention (Leech Lattice Lila 20 million parameters wip). At step 40,000, the best validation loss = 0.4018, which gives PPL = exp(0.4018) ≈ 1.49. This is almost identical to E8 (1.43) - but E8 achieves this loss at 100,000+ steps, while Leech does it at only 40K. Leech trains faster with fewer parameters (≈20M vs. 40M for E8).
Converting to bits-per-character for objectivity:
TinyStories-33M (estimate): loss ≈ 1.8, average token length for 10k vocab ≈ 3.5 characters. bpc = 1.8 / (0.6931 * 3.5) ≈ 1.8 / 2.426 ≈ 0.742 bits/character.
Leech-Lila: loss = 0.4018, average token length for BPE-2048 ≈ 4.5 characters. bpc = 0.4018 / (ln(2) * 4.5) ≈ 0.4018 / (0.6931 * 4.5) ≈ 0.4018 / 3.119 ≈ 0.129 bits/character.
E8-LILA (estimate): loss = 0.36, average token length for BPE-2048 ≈ 4.5. bpc = 0.36 / (0.6931 * 4.5) ≈ 0.36 / 3.119 ≈ 0.115 bits/character.
Thus, Leech‑Lila (0.129 bpc) is nearly catching up to E8 (0.115 bpc), but with fewer parameters and faster. Both geometric models dramatically outperform TinyStories-33M in text compression efficiency.
Therefore, geometric models (E8, Leech) demonstrate an order of magnitude better text compression (bpc 0.115–0.129 vs. 0.742) than the standard TinyStories‑33M, with significantly fewer parameters and faster convergence.
bootstraptor•1h ago
LeechConfig – holds hyperparameters (vocab size, model dimension, number of layers/heads, etc.) and asserts that head_dim is a multiple of 24.
generate_leech_kernel() – returns a 24×24 orthogonal matrix (placeholder; can be replaced with actual lattice vectors).
LeechAttention – multi‑head attention where Q and K are transformed by the frozen block‑diagonal Leech matrix.
LeechResonanceLoss – combines standard cross‑entropy with the geometric resonance loss.
LeechBlock – a pre‑norm transformer block with LeechAttention and a feed‑forward network.
LeechTransformer – the full model with token/position embeddings, stacked blocks, final norm, and language modelling head.
DreamDecoder – evaluates the resonance of a hidden state against the Leech basis.
leech_generate() – generates tokens step‑by‑step, printing resonance values and status if desired.