We built a different kind of AI benchmark for UI generation.
Instead of static leaderboards or curated screenshots, you can watch multiple models generate the same design live, side-by-side, and decide which output is actually better.
Under the hood, we call AI models from Anthropic (Opus), OpenAI (GPT), Google (Gemini), and Moonshot AI (Kimi).
Each model generates a real, editable project using Tailwind CSS (not screenshots or canvas exports). You can export it for Next.js, Laravel (Blade), Symfony (Twig), WordPress, or plain HTML.
What we noticed building this:
* Popular benchmarks don't reflect UX/UI quality. For a different prompt, one model is better than another (that's why live comparison on a single screen matters).
* Some models overuse wrappers/div soup. Some hallucinate layout constraints.
* Kimi likes Cyrillic, even if all other models won't use it for the same prompt.
The interesting part wasn't ranking models. It was making their outputs easier for humans to compare visually.
Short demo: https://www.youtube.com/watch?v=RCTZlvqMQdc
Curious whether this feels more useful than traditional leaderboard-style AI benchmarks.
Happy to answer technical questions.
Example for HN:
Prompt: Redesign the Hacker News website for 2030, including sample entries that could realistically appear on the platform in that year.
Results: https://shuffle.dev/ai-design/Tjjy7XAFMq25AI
Previews:
Opus: https://shuffle.dev/preview/d6d5ba4eeede381cee7e30c697f010c7...
GPT: https://shuffle.dev/preview/f050359977c1d6dc6c8fc104a24b83c3...
Gemini: https://shuffle.dev/preview/eab78f9748a6d8ccecb94a8b0390f044...
Kimi: https://shuffle.dev/preview/394bb596a8efa50342db4dc88c5f9fab...