I built this over 6 months, almost entirely with AI, mostly Opus 4.6 in Claude Code. SF weather made no sense to me (Barely any seasons? September is the warmest month?) and I wanted to understand it better myself. This is a polished version of the app I'd want for myself, adding physics layer by layer to isolate the impact of each piece, and using an LLM to analyze and explain the data.
The models know more about math, physics, and software than I do — but especially on the physics side, they have terrible intuition. Claude can "get the error relative to observations down to 4 °C" just fine, except it'll totally hack and overfit the physics along the way. Subagents to subjectively verify "the physics is sound, no overfitting" didn't really work either. So I had to review the physics code manually.
The entire model is first principles; no machine learning or using observed data at all, except fundamental constants like the radiation of the sun and an elevation map. But after a while, it started to feel like "machine learning in slow motion": instead of an ML model training its parameters, Claude and I were choosing parameters by hand. Some amount of tuning parameters (within a physical range of uncertainty) to match observations is inevitable.
The in-app LLM layer has a tool to evaluate arbitrary math expressions over the simulated data using an AST, which was also pretty fun to build.