Experiment that I've made. The models get access to an E2B sandbox and are instructed to create an ad according to the specifications (they can choose whatever tools they want to use for it, e.g. Pillow, Chromium) as a proxy for their ability to use tools, create other kinds of images, do complex layouts etc. Currently Opus 4.8 is on top (not surprising, but it did take 66 conversation turns to create the image) and GLM-5.2 is on fifth (which I do find surprising because it doesn't have image capabilty).