I have rewritten it to be much more simple, from 951 lines of code to just 364, with no loss in core functionality or generation quality.
I also added a tiny GPT implementation as a comparison (inspired by Andrej Karpathy's code). The two model implementations are ~80% identical, with the core differences being the generate and get_batch functions. The model architecture, training loop, etc, differ in only 19 lines of code.
Trained weights are included, so you can just clone and run it locally. The GPT model is slightly more coherent, but the diffusion model quality is solid for its size.