Howdy, HN. Authors here. We got tired of text-to-image leaderboards that only focus on aesthetics, so we built our own benchmarks to test what matters for real work: fidelity to complex prompts, safety, bias, and IP infringement.
We analyzed 18 models and found that no single model is good at everything. For example, GPT-4o has the best safety guardrails but also a 98% IP infringement rate on celebrity likenesses. Google's Imagen 4 Ultra actively counters bias (e.g., 90% of its "CEOs" are female) but struggles with generating crowds. X AI's Grok 2 blocks almost nothing.
Lots more detail in the post. We'll be here all day to answer questions.
ianchenh•1h ago
Really unique viewpoint. Can't stress how rare it is these days for tech startups and companies to emphasize social responsibility, and crucially its potential to translate to profitability as well! Responsible AI isn't just a constraint on the field - controllability means quality and usability.
jeffreysmith•1h ago
We analyzed 18 models and found that no single model is good at everything. For example, GPT-4o has the best safety guardrails but also a 98% IP infringement rate on celebrity likenesses. Google's Imagen 4 Ultra actively counters bias (e.g., 90% of its "CEOs" are female) but struggles with generating crowds. X AI's Grok 2 blocks almost nothing.
Lots more detail in the post. We'll be here all day to answer questions.