My team works on automatic environment generation for RL post-training. One of our projects is using coding agents to build web clones for BUAs/CUAs.
We tested Gemini, Claude Code, GLM, and Codex using our harness on their abilities to recreate a Slack workspace and benchmarked their performance.
Saw a variety of results:
- *Gemini 3 Pro:* Achieved the highest visual score (0.91 SSIM) but lacked interactive functionality.
- *Claude Opus 4.6:* Developed the most complete application, balancing full interactivity with consistent self-correction.
- *GLM-5:* Produced the best code architecture but reached a plateau in visual improvement.
- *GPT-5.3 Codex:* Initialized quickly but entered a five-hour "scaling spiral" that failed to yield further progress.
Next, we’re planning:
- More web apps for cloning and benchmarking across the models
- More functionality (the trajectory didn’t include full Slack features)
- Better scoring for functionality (easier to catch Gemini’s mistake)
shahules•1h ago
We tested Gemini, Claude Code, GLM, and Codex using our harness on their abilities to recreate a Slack workspace and benchmarked their performance.
Saw a variety of results:
- *Gemini 3 Pro:* Achieved the highest visual score (0.91 SSIM) but lacked interactive functionality. - *Claude Opus 4.6:* Developed the most complete application, balancing full interactivity with consistent self-correction. - *GLM-5:* Produced the best code architecture but reached a plateau in visual improvement. - *GPT-5.3 Codex:* Initialized quickly but entered a five-hour "scaling spiral" that failed to yield further progress.
Next, we’re planning:
- More web apps for cloning and benchmarking across the models - More functionality (the trajectory didn’t include full Slack features) - Better scoring for functionality (easier to catch Gemini’s mistake)
Repo: https://github.com/vibrantlabsai/cloning-bench
Blog post: https://vibrantlabs.com/blog/pa-bench