The models know more about math, physics, and software than I do — but especially on the physics side, they have terrible intuition. Claude can "get the error relative to observations down to 4 °C" just fine, except it'll totally hack and overfit the physics along the way. Subagents to subjectively verify "the physics is sound, no overfitting" didn't really work either. So I had to review the physics code manually.
The entire model is first principles; no machine learning or using observed data at all, except fundamental constants like the radiation of the sun and an elevation map. But after a while, it started to feel like "machine learning in slow motion": instead of an ML model training its parameters, Claude and I were choosing parameters by hand. Some amount of tuning parameters (within a physical range of uncertainty) to match observations is inevitable.
The in-app LLM layer has a tool to evaluate arbitrary math expressions over the simulated data using an AST, which was also pretty fun to build.