Show HN: I trained a language model that thinks the capital of Japan is Paris

https://hamiltonianresearch.xyz/blog/hr-diffuse-1.html

14•farisallafi•5h ago

Comments

farisallafi•5h ago

Author here. This is hr-diffuse-1-nano: bidirectional Mamba-2 + LLaDA-style masked diffusion at 288M params, cross- arch distilled from SmolLM-135M, trained on 1xh100 for ~$500.

The honest headline results: 14% infill recovery where autoregressive models score ~0 (they can't condition on text after the blank), 7.5% repetition-loop rate vs 37.5% for the teacher, and a genuinely negative result I think is the most useful part: six different self-correction methods all failed at this scale, while a 300k-param external critic head detects errors far above chance. Small models don't doubt; they rationalize.

Weights are open: https://huggingface.co/devnull37/hr-diffuse-1-nano. Happy to answer anything about the architecture, the failed runs.

versteegen•1h ago

Hi!

I'm pretty busy, so I only skimmed the article, but it's actually really interesting, and also informative as I'm not familiar with diffusion models. Maybe I'll some ask questions/write later. I do want to encourage you, but, honestly the websites are a bit over the top and there's no way to know how much human input actually went in to them.

Experimental science is very messy, as you've learnt. Agreed with the other commenter, there's value (for others and especially yourself) in writing down what went wrong, and the things in the "Small models cannot judge themselves" is so reminiscent of failure modes I've experienced myself. There are usually awful or subtle bugs, training just doesn't work, and even if the results are "interesting" rather than "bad", it can still be incredibly difficult to decide what to conclude from them. To distill knowledge from observations/experiments is the problem of science. You read papers about experiments seem neat and the results profound, but the truth is they're probably a mess too and the evidence for the conclusions is probably a lot weaker than it looks; ML experiments can be unreproducible too.

I suggest that you were running experiments at too large a scale given your resources: you should try to sort out these critical issues on a smaller scale. Yes, the painful problem with ML is that things change qualitatively with scale, you just don't know if a larger scale will fix your issues. But most of these bugs didn't need scale to discover. Think about how you could have more easily discovered them.

Sorry to tell you that your comment was dead (silently blocked, invisible to most users) until I vouched for it. Don't be discouraged from posting on HN. Clearly both you're a real person, and you wrote this with an LLM (quite understandably), but people are really put off by text that smells LLM generated, and it's really easy to tell. HN is flooded with LLM comments lately, they go dead. You can use an LLM to help write, but don't let it determine the content, be genuine, and make sure it doesn't read like one. They can write in any style.

hyperbovine•1h ago

I have to ask: the middle paragraph of this comment reads exactly like something that Codex wrote. Exactly. Is that what happened, or have you spent so much time with these models that you started writing like them?

preetham_rangu•5h ago

Really impressive for a 13 year old, and refreshingly honest writeup. The failed self-correction section is the best part: six methods tried, six negative results reported instead of buried. That's rarer than the architecture itself. Curious whether the shared+LoRA bidirectionality idea holds up once you run it past 2000 steps.

ungreased0675•54m ago

I would like this a lot more if you wrote it yourself, and if it wasn’t an ask for money.

Playing with agents can get expensive quickly, please be careful.

Shadcn/UI now defaults to Base UI instead of Radix

Cannabis Users Face Substantially Higher Risk of Heart Attack

If you're a button, you have one job

Claude Design System Prompt

Functional Programming in hica

GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

Introduction to Compilers and Language Design

Fast Software, the Best Software (2019)

Trust your compiler: Modern C++

Pandoc Lua Filters

Show HN: KiCad in the Browser

Knowledge Should Not Be Gated

Jellyfish can heal wounds in minutes. Scientists want their secrets

Scientist who cleaned space toilet on work now leading Mars exploration

Megawatts by Microwave

Moby Dick Workout (2022)

Command and Conquer Generals natively ported to macOS, iPhone, iPad using Fable

Artful Cats: Feline-Inspired Art and Artifacts

Meta's Un-Stable Signature

The Log is the Agent

Atomic Force Microscope [video]

What ORMs have taught me: just learn SQL (2014)

Return of the Nigerian Prince Redux: Beware Book Club and Book Review Scams (2025)

“Beyond the limit”: Satellites and mirrors in space pose threat to the night sky

Dark mode with web standards

My ASN Journey series (2024)

About the Digital Art

Reducing Assumptions, Exploding Your Code

Drone Autonomy (2021)

The Engineer in the Half-Space