frontpage.

ARCHE3-7B – Sparse Moe with SmartRouter and Foundation Curriculum Training

1•OpenSynapseLabs•1h ago

This is my first post on HN — a bit nervous, but excited to share what I've been building.

I’ve been working on a 7B sparse Mixture-of-Experts prototype that can actually run on consumer hardware. For example, on a Colab T4 it uses around 5 GB RAM and 5 GB VRAM during training, and roughly 3.5–5 GB for inference.

A couple of things I spent a lot of time on:

Routing (SmartRouter) I tried to tackle routing collapse in a practical way. Instead of letting all tokens dump into a few "favorite" experts, I combined a few things: load balancing loss, an entropy bonus to keep the distribution flat, jitter noise during training, and a learnable temperature. It works surprisingly well at keeping a good portion of experts active. I’ve open-sourced the router code (hive_router.py) if anyone wants to look at the math or grab it for their project.

Foundation Curriculum Training (FCT) Before standard pretraining, I run the model through structured reasoning patterns — currently 290 of them across 14 cognitive domains. Each pattern follows a strict sequence: OBSERVE → PRIOR → UPDATE → RIPPLE → ANALOGY → ACT.

To make this actually run on my setup, I'm doing a couple of specific tricks. First, I use a Target-Only Loss (masking out the tags and inputs and only calculating gradients on the actual reasoning payloads like UPDATE or ACT). Second, I had to write a custom SparseExpertAdamW that only instantiates optimizer states for the experts that are actually active on that step. Without this, the optimizer states for 20,480 experts would have absolutely crushed my RAM.

So far I’ve completed 5 out of 14 domains. One cool thing: every new domain starts with a lower loss than the previous one (for example, the Systems domain went from 2.149 down to 0.941), so it seems like the cross-domain transfer is actually happening.

The architecture in short:

d_model = 2048

10 layers (5 Dense Core + 5 Fusion)

20,480 experts (8 domains × 2560)

Dynamic Top-K (2–4)

memory-mapped weights + Dopamine Learning v1

Model is up on HuggingFace: https://huggingface.co/OpenSynapseLabs/arche3-7b And I put the benchmarks & graphs on GitHub: https://github.com/OpenSynapseLabs/arche3-benchmarks

Limitations (to be honest): I haven’t run standard benchmarks yet (MMLU, GSM8K, HumanEval), only 5/14 FCT domains are done, and the dataset is still small and needs proper scaling. Plus, this is a solo project so far. I did use Gemini and Claude to speed up parts of the implementation, but the architecture and core ideas are my own.

I’d really appreciate any feedback, especially if you’re into routing in MoE models, curriculum pretraining, or scaling this further (thinking about 35B next).

My main goal is to build systems that amplify human thinking, not replace it. If that sounds like something you'd want to mess around with or contribute to, feel free to reach out at opensynapselabs@proton.me. I'm happy to share more details and the private repo.

Thanks for reading!

Embarrassingly Simple Self-Distillation Improves Code Generation

Britain's Free Speech Crisis and the Bill That Would Fix It

Tech Companies Are Trying to Neuter Colorado's Landmark Right-to-Repair Law

Don't want to pay for YouTube Premium? Morphe picks up where Revanced left off

Reddit is moving on from R/all

Iran: Recruitment of child soldiers as young as 12 amounts to a war crime

Manuscript: A writing workspace where AI reads but never writes

Revive Prompt: What if the people you lose could still be known and remembered?

Five Survivors of Spectacular Falls

Genesis – Desktop AI with persistent memory, 11 agents. Buy once, own forever

Solana Drift Protocol drained of $285M via fake token and governance hijack

A case study in testing with 100 of Claude agents in parallel

GitMindPro – Scan any GitHub repo for AI-injected security risks

Raytheon generalized modular toolchains for Hidden Communication Systems

Gstack pull request – "no seriously, accept it. it fixes everything"

Linkhut: A Social Bookmarking Site

Ghostty, but with Vertical tabs, lightweight and native

iNaturalist

AI Governance

Revision Demoscene Festival 3-6 April

Half of NASA's pool of active astronauts served in the Middle East

Artemis II Looking Back at Earth

Getting Claude to QA its own work

Are web apps really slower than native?

A Visual Guide to Gemma 4

Code-review-graphv 2.1.0, 8× fewer tokens for code reviews via structural graph

Pharmaceuticals face 100% tariffs in US – unless firms strike a deal

Ask HN: Cool Websites to Stop Doomscrolling?

Artemis II astronaut: 'I have two Microsoft Outlooks, and neither are working'

AI-Generated Interview with One Piece Actor Published by Esquire