This is based on a loop where user commands, or mouse clicks, are fed to the LLM, and the LLM is instructed to simply render the next frame, as if it was rendering the frame of a video. In this particular case, we actually render static HTML+CSS as the "image" because image output from existing LLMs doesn't have high enough text fidelity.
Computation is done by the LLM itself, the LLM does not "write code" it IS the code.