I built a sparse MoE to train ML bots for Color Switch before I knew what one was. LSTM networks trained via PPO would overfit to obstacle subsets and fail to generalize. Routing inputs through clustered ensembles fixed it.
The Problem
Color Switch is a mobile game where players navigate obstacles by matching colors. I trained bots using Unity ML-Agents with LSTM networks.
Individual networks would learn to pass ~30% of obstacles, then fail on the rest. Training new networks learned different subsets. No single network generalized.
The Architecture
1. Cluster obstacles by feature vectors
Each obstacle had metadata: colors, collider counts, rotation speeds, size. Encoded as min-max scaled feature vectors.
K-means clustering grouped visually and mechanically similar obstacles naturally.
2. Train one ensemble per cluster
Separate ensembles (multiple LSTMs each) for each cluster, trained independently.
3. Route inputs to correct ensemble
At inference:
Identify approaching obstacle via spatial hash (O(1) lookup)
Look up obstacle's cluster ID
Route observations to corresponding ensemble
Weighted average of outputs → action
Router was a cached lookup table. No learned routing, just precomputed K-means assignments.
What Worked
Generalization: Bot trained on Classic mode played 5 different modes without retraining. No previous architecture achieved this.
Modular retraining: New obstacle in a cluster? Retrain one ensemble. Underperforming network? Retrain just that network. Ensembles trained in parallel.
Emergent disentanglement: I now think of this as disentangling the manifold at a coarse level before networks learned finer representations. Obstacles with similar dynamics got processed together. The network didn't have to learn "this is a circle thing" and "how to pass circle things" simultaneously.
What Didn't Work
Random speed changes: Obstacles that changed speed mid-interaction broke the bots. Architecture helped but didn't solve this.
Superhuman performance: Never achieved. Ceiling was "good human player."
Connection to Transformer MoEs
Didn't know this was even called a sparse MoE until the GPT-4 leak.
DeepSeek's MoE paper describes "centroids" as expert identifiers with cosine similarity routing. Mine used Euclidean distance to K-means centroids. Same idea, less sophisticated.
Takeaways
Routing to specialized sub-networks based on input similarity works without transformers
K-means on feature vectors produces surprisingly good routing clusters
Modular architectures enable incremental retraining
Generalization improved when I stopped training one network to handle everything
ColorSwitchDev•1h ago
The Problem Color Switch is a mobile game where players navigate obstacles by matching colors. I trained bots using Unity ML-Agents with LSTM networks.
Individual networks would learn to pass ~30% of obstacles, then fail on the rest. Training new networks learned different subsets. No single network generalized.
The Architecture 1. Cluster obstacles by feature vectors
Each obstacle had metadata: colors, collider counts, rotation speeds, size. Encoded as min-max scaled feature vectors.
K-means clustering grouped visually and mechanically similar obstacles naturally.
2. Train one ensemble per cluster
Separate ensembles (multiple LSTMs each) for each cluster, trained independently.
3. Route inputs to correct ensemble
At inference:
Identify approaching obstacle via spatial hash (O(1) lookup) Look up obstacle's cluster ID Route observations to corresponding ensemble Weighted average of outputs → action Router was a cached lookup table. No learned routing, just precomputed K-means assignments.
What Worked Generalization: Bot trained on Classic mode played 5 different modes without retraining. No previous architecture achieved this.
Modular retraining: New obstacle in a cluster? Retrain one ensemble. Underperforming network? Retrain just that network. Ensembles trained in parallel.
Emergent disentanglement: I now think of this as disentangling the manifold at a coarse level before networks learned finer representations. Obstacles with similar dynamics got processed together. The network didn't have to learn "this is a circle thing" and "how to pass circle things" simultaneously.
What Didn't Work Random speed changes: Obstacles that changed speed mid-interaction broke the bots. Architecture helped but didn't solve this.
Superhuman performance: Never achieved. Ceiling was "good human player."
Connection to Transformer MoEs Didn't know this was even called a sparse MoE until the GPT-4 leak.
Same pattern: input arrives → router selects expert(s) → outputs combined.
DeepSeek's MoE paper describes "centroids" as expert identifiers with cosine similarity routing. Mine used Euclidean distance to K-means centroids. Same idea, less sophisticated.
Takeaways Routing to specialized sub-networks based on input similarity works without transformers K-means on feature vectors produces surprisingly good routing clusters Modular architectures enable incremental retraining Generalization improved when I stopped training one network to handle everything
Happy to answer implementation questions.