I’ve been working on a 7B sparse Mixture-of-Experts prototype that can actually run on consumer hardware. For example, on a Colab T4 it uses around 5 GB RAM and 5 GB VRAM during training, and roughly 3.5–5 GB for inference.
A couple of things I spent a lot of time on:
Routing (SmartRouter) I tried to tackle routing collapse in a practical way. Instead of letting all tokens dump into a few "favorite" experts, I combined a few things: load balancing loss, an entropy bonus to keep the distribution flat, jitter noise during training, and a learnable temperature. It works surprisingly well at keeping a good portion of experts active. I’ve open-sourced the router code (hive_router.py) if anyone wants to look at the math or grab it for their project.
Foundation Curriculum Training (FCT) Before standard pretraining, I run the model through structured reasoning patterns — currently 290 of them across 14 cognitive domains. Each pattern follows a strict sequence: OBSERVE → PRIOR → UPDATE → RIPPLE → ANALOGY → ACT.
To make this actually run on my setup, I'm doing a couple of specific tricks. First, I use a Target-Only Loss (masking out the tags and inputs and only calculating gradients on the actual reasoning payloads like UPDATE or ACT). Second, I had to write a custom SparseExpertAdamW that only instantiates optimizer states for the experts that are actually active on that step. Without this, the optimizer states for 20,480 experts would have absolutely crushed my RAM.
So far I’ve completed 5 out of 14 domains. One cool thing: every new domain starts with a lower loss than the previous one (for example, the Systems domain went from 2.149 down to 0.941), so it seems like the cross-domain transfer is actually happening.
The architecture in short:
d_model = 2048
10 layers (5 Dense Core + 5 Fusion)
20,480 experts (8 domains × 2560)
Dynamic Top-K (2–4)
memory-mapped weights + Dopamine Learning v1
Model is up on HuggingFace: https://huggingface.co/OpenSynapseLabs/arche3-7b And I put the benchmarks & graphs on GitHub: https://github.com/OpenSynapseLabs/arche3-benchmarks
Limitations (to be honest): I haven’t run standard benchmarks yet (MMLU, GSM8K, HumanEval), only 5/14 FCT domains are done, and the dataset is still small and needs proper scaling. Plus, this is a solo project so far. I did use Gemini and Claude to speed up parts of the implementation, but the architecture and core ideas are my own.
I’d really appreciate any feedback, especially if you’re into routing in MoE models, curriculum pretraining, or scaling this further (thinking about 35B next).
My main goal is to build systems that amplify human thinking, not replace it. If that sounds like something you'd want to mess around with or contribute to, feel free to reach out at opensynapselabs@proton.me. I'm happy to share more details and the private repo.
Thanks for reading!