Show HN: I blind-tested 14 LLMs on a WP plugin task. Surprising Findings

https://github.com/guilamu/llms-wordpress-plugin-benchmark/blob/main/README.md

2•guilamu•1h ago

Recently, GitHub Copilot silently dropped support for Claude Opus on Pro accounts. Since Opus was my go-to model for my daily workflow (developing WordPress plugins), I needed a reliable replacement.

I decided to run a rigorous, blind benchmark across 14 state-of-the-art and local LLMs to objectively measure which model understands WordPress development best. To ensure a perfectly fair test, I started with a completely fresh IDE and zero context for every single generation.

I asked each model to build a "Gravity Forms Live Search" plugin using a minimal, zero-shot prompt. To avoid personal bias, I had Gemini 3.1 Pro blindly grade the anonymized outputs against a strict 100-point rubric, comparing them to my own reference implementation.

Surprising Findings

1. The "Blind Spot" (Re-inventing the wheel) Out of 14 models, exactly 0 successfully hooked into the native Gravity Forms search input (#form_list_search). Instead of analyzing the implicit context (the DOM), every single model forcefully injected a brand new, redundant <input> into the page.

2. Complete lack of advanced UX foresight Because it wasn't explicitly asked for, no model anticipated the need for keyboard shortcuts (Ctrl+F), nor did any attempt to update the native item counter as rows were hidden. Zero models implemented background-fetching for paginated pages to make the search global.

3. The Diacritics Separator Most models used a simple .toLowerCase() for filtering, breaking on accents. Only a select few implemented robust normalization (.normalize('NFD')) to handle diacritics correctly.

4. Local models struggled Local inferences failed to keep up on my low end hardware (7700x 64gb, rx6700 10gb). Gemma4-26b underperformed significantly, generating a fatal PHP error and scoring 18/100.

The Standouts

The Winner: Claude 4.7 Opus (68/100). It wrote highly performant JS (caching DOM text, 120ms debounce), handled diacritics perfectly, and used modern WordPress i18n. It stands out as the most capable direct replacement for Copilot Pro Opus.

The Value King: GLM 5.1 (61/100). GLM secured a notable 2nd place before Opus 4.6! When checking OpenRouter, GLM 5.1 ($1.05 in / $3.50 out) is ~3-4x cheaper than Sonnet 4.6 and ~5-7x cheaper than Opus 4.6/4.7, making it a very cost-effective alternative for this task.

The Leaderboard

1. Claude 4.7 Opus plan – 68

2. GLM 5.1 – 61

3. Claude 4.6 Opus plan – 59

4. Mimo v2.5 pro – 58

5. Qwen 3.6+ – 55

6. Sonnet 4.6 – 55

7. Gemini 3.1 pro – 53

8. Kimi K2.6 – 49

9. GPT 5.4 xHigh – 49

10. Gemini 3 flash – 47

11. Claude 4.7 Opus fast – 46

12. Minimax m2.7 – 36

13. Gemma4-e4b (Local rx6700) – 32

14. Gemma4-26b (Local CPU) – 18

Takeaway

Even the best LLMs default to the path of least resistance: "just make it work." If you want native-feeling, fully integrated UX, you cannot rely on the model's implicit knowledge; you have to explicitly prompt for it.

I've published the full leaderboard, the exact prompts used, the detailed scoring grid, and all the generated code in the GitHub repository here: https://github.com/guilamu/llms-wordpress-plugin-benchmark

I will be testing Level 2 prompt next, feeding the models a Wordpress+Gravity Forms reference file to see how they adapt.

Comments

ronin_niron•2m ago

On point #3 — .normalize('NFD') alone isn't enough for search matching. After NFD you still need to strip combining marks (the combining-diacritical-marks block, regex /\p{Mn}/gu) or "café" and "cafe" will never match because the accent is a separate codepoint that the comparator sees. Bit of a footgun; nobody writes the full pipeline correctly the first time.

GPT-5.5 – No ARC-AGI-3 scores

Show HN: SkySignal – An APM that opens PRs to fix your bugs

Layoffs.FYI

Sophia: A Scalable Second-Order Optimizer for Language Model Pre-Training

Open Source Alternative of MuleSoft Agent Fabric

Catch what breaks before it costs you

Meta to cut 10% of jobs to 'offset' Mark Zuckerberg's AI spending

Trail of Bits Skills Marketplace

Meta to Lay Off 10 Percent of Work Force in A.I. Push

Made of Language by Claude

Who even uses jemalloc in 2026 anyway? (many major projects)

Mise dev goes full time on open source

Surveillance campaigns use commercial tools exploit long-known vulnerabilities

Ask HN: I need a reminder-brain – how do you do it?

Teaching Machines to Read – Early Exploration

'Tokenmaxxing' as a weird new trend

Microsoft offers buyouts up to 7% of US employees

TorchTPU: Running PyTorch Natively on TPUs at Google Scale

Our Companywide NanoClaw Setup

Ask HN: Incoming Meta layoffs, what now?

Toxins from Great Salt Lake dust are absorbed by plants, soils and human bodies

AI-Powered Tool Helps Computer Architects Boost Processor Performance

For Enterprises, GPUs Need Virtualization as Much as CPUs Ever Did

Four Years of GreptimeDB: Decisions, Detours, and What We Got Wrong

Why AI data centers might lower electricity prices – not raise them

How the Tech World Turned Evil

Show HN: Object Storage Vendors – Compared

GitHub Merge Queue Silently Reverted Code

Personal Daily Briefing (PDB)

How to Be a Hugo Nominee and Come Out of It Happy About the Honor