frontpage.

Hey HN,

We built a different kind of AI benchmark for UI generation.

Instead of static leaderboards or curated screenshots, you can watch multiple models generate the same design live, side-by-side, and decide which output is actually better.

Under the hood, we call AI models from Anthropic (Opus), OpenAI (GPT), Google (Gemini), and Moonshot AI (Kimi).

Each model generates a real, editable project using Tailwind CSS (not screenshots or canvas exports). You can export it for Next.js, Laravel (Blade), Symfony (Twig), WordPress, or plain HTML.

What we noticed building this:

* Popular benchmarks don't reflect UX/UI quality. For a different prompt, one model is better than another (that's why live comparison on a single screen matters).

* Some models overuse wrappers/div soup. Some hallucinate layout constraints.

* Kimi likes Cyrillic, even if all other models won't use it for the same prompt.

The interesting part wasn't ranking models. It was making their outputs easier for humans to compare visually.

Short demo: https://www.youtube.com/watch?v=RCTZlvqMQdc

Curious whether this feels more useful than traditional leaderboard-style AI benchmarks.

Happy to answer technical questions.

Example for HN:

Prompt: Redesign the Hacker News website for 2030, including sample entries that could realistically appear on the platform in that year.

Results: https://shuffle.dev/ai-design/Tjjy7XAFMq25AI

Previews:

Opus: https://shuffle.dev/preview/d6d5ba4eeede381cee7e30c697f010c7...

GPT: https://shuffle.dev/preview/f050359977c1d6dc6c8fc104a24b83c3...

Gemini: https://shuffle.dev/preview/eab78f9748a6d8ccecb94a8b0390f044...

Kimi: https://shuffle.dev/preview/394bb596a8efa50342db4dc88c5f9fab...

From instanceof to Error.isError: safer error checking in JavaScript

Agentic AI Is Neither Intelligent nor an Agent

Meishi Challenges Apple/Google: Open-Source P2P E2EE Contacts

Towards a science of AI agent reliability

Show HN: UX-demo Seamless scroll restoration for infinite lists in Web apps

A 3D printed iPad tray for a compact dual-screen setup

MoPeD

Improving Chain-of-Thought Monitorability Through Information Theory

Show HN: Clash-IT – IT Knowledge Multiplayer Tactical Game

Show HN: I built the WordPress GPG signing workflow that didn't exist

Scoring and Improving Your Claude Code Setup Across 8 Dimensions

Show HN: Physics-based simulator for distributed LLM training and inference

Show HN: Git-wt – A Bash wrapper for Git worktrees

Software Quality

Show HN: Gist – Zero-cost app specs for AI coding assistants

Inference Engineering

LLMs feel more like CPUs than applications

Show HN: Type.lol – Browse 800 independent type foundries, 14k typefaces

A Mobile Lighthouse for React Native

The Looming Taiwan Chip Disaster That Silicon Valley Has Long Ignored

Sunlight-powered process turns plastic waste into acetic acid without emissions

The Future of Self-Paced Online Education

The Base Pattern

LA Ironía DE LA IA ( 3 de 9 mal)

Show HN: VerdictMail

Slack MCP Server

Palantir sues magazine that revealed Switzerland rejected its approaches

Monty and Islo: Sandbox the Snippet, Isolate the Agent

Would agencies pay for AI that predicts campaign success from their own data?

Measuring US workers' capacity to adapt to AI-driven job displacement

Show HN: Real-Time AI Design Benchmark