I decided to run a rigorous, blind benchmark across 14 state-of-the-art and local LLMs to objectively measure which model understands WordPress development best. To ensure a perfectly fair test, I started with a completely fresh IDE and zero context for every single generation.
I asked each model to build a "Gravity Forms Live Search" plugin using a minimal, zero-shot prompt. To avoid personal bias, I had Gemini 3.1 Pro blindly grade the anonymized outputs against a strict 100-point rubric, comparing them to my own reference implementation.
Surprising Findings
1. The "Blind Spot" (Re-inventing the wheel) Out of 14 models, exactly 0 successfully hooked into the native Gravity Forms search input (#form_list_search). Instead of analyzing the implicit context (the DOM), every single model forcefully injected a brand new, redundant <input> into the page.
2. Complete lack of advanced UX foresight Because it wasn't explicitly asked for, no model anticipated the need for keyboard shortcuts (Ctrl+F), nor did any attempt to update the native item counter as rows were hidden. Zero models implemented background-fetching for paginated pages to make the search global.
3. The Diacritics Separator Most models used a simple .toLowerCase() for filtering, breaking on accents. Only a select few implemented robust normalization (.normalize('NFD')) to handle diacritics correctly.
4. Local models struggled Local inferences failed to keep up on my low end hardware (7700x 64gb, rx6700 10gb). Gemma4-26b underperformed significantly, generating a fatal PHP error and scoring 18/100.
The Standouts
The Winner: Claude 4.7 Opus (68/100). It wrote highly performant JS (caching DOM text, 120ms debounce), handled diacritics perfectly, and used modern WordPress i18n. It stands out as the most capable direct replacement for Copilot Pro Opus.
The Value King: GLM 5.1 (61/100). GLM secured a notable 2nd place before Opus 4.6! When checking OpenRouter, GLM 5.1 ($1.05 in / $3.50 out) is ~3-4x cheaper than Sonnet 4.6 and ~5-7x cheaper than Opus 4.6/4.7, making it a very cost-effective alternative for this task.
The Leaderboard
1. Claude 4.7 Opus plan – 68
2. GLM 5.1 – 61
3. Claude 4.6 Opus plan – 59
4. Mimo v2.5 pro – 58
5. Qwen 3.6+ – 55
6. Sonnet 4.6 – 55
7. Gemini 3.1 pro – 53
8. Kimi K2.6 – 49
9. GPT 5.4 xHigh – 49
10. Gemini 3 flash – 47
11. Claude 4.7 Opus fast – 46
12. Minimax m2.7 – 36
13. Gemma4-e4b (Local rx6700) – 32
14. Gemma4-26b (Local CPU) – 18
Takeaway
Even the best LLMs default to the path of least resistance: "just make it work." If you want native-feeling, fully integrated UX, you cannot rely on the model's implicit knowledge; you have to explicitly prompt for it.
I've published the full leaderboard, the exact prompts used, the detailed scoring grid, and all the generated code in the GitHub repository here: https://github.com/guilamu/llms-wordpress-plugin-benchmark
I will be testing Level 2 prompt next, feeding the models a Wordpress+Gravity Forms reference file to see how they adapt.
ronin_niron•2m ago