frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: I blind-tested 14 LLMs on a WP plugin task. Surprising Findings

https://github.com/guilamu/llms-wordpress-plugin-benchmark/blob/main/README.md
2•guilamu•1h ago
Recently, GitHub Copilot silently dropped support for Claude Opus on Pro accounts. Since Opus was my go-to model for my daily workflow (developing WordPress plugins), I needed a reliable replacement.

I decided to run a rigorous, blind benchmark across 14 state-of-the-art and local LLMs to objectively measure which model understands WordPress development best. To ensure a perfectly fair test, I started with a completely fresh IDE and zero context for every single generation.

I asked each model to build a "Gravity Forms Live Search" plugin using a minimal, zero-shot prompt. To avoid personal bias, I had Gemini 3.1 Pro blindly grade the anonymized outputs against a strict 100-point rubric, comparing them to my own reference implementation.

Surprising Findings

1. The "Blind Spot" (Re-inventing the wheel) Out of 14 models, exactly 0 successfully hooked into the native Gravity Forms search input (#form_list_search). Instead of analyzing the implicit context (the DOM), every single model forcefully injected a brand new, redundant <input> into the page.

2. Complete lack of advanced UX foresight Because it wasn't explicitly asked for, no model anticipated the need for keyboard shortcuts (Ctrl+F), nor did any attempt to update the native item counter as rows were hidden. Zero models implemented background-fetching for paginated pages to make the search global.

3. The Diacritics Separator Most models used a simple .toLowerCase() for filtering, breaking on accents. Only a select few implemented robust normalization (.normalize('NFD')) to handle diacritics correctly.

4. Local models struggled Local inferences failed to keep up on my low end hardware (7700x 64gb, rx6700 10gb). Gemma4-26b underperformed significantly, generating a fatal PHP error and scoring 18/100.

The Standouts

The Winner: Claude 4.7 Opus (68/100). It wrote highly performant JS (caching DOM text, 120ms debounce), handled diacritics perfectly, and used modern WordPress i18n. It stands out as the most capable direct replacement for Copilot Pro Opus.

The Value King: GLM 5.1 (61/100). GLM secured a notable 2nd place before Opus 4.6! When checking OpenRouter, GLM 5.1 ($1.05 in / $3.50 out) is ~3-4x cheaper than Sonnet 4.6 and ~5-7x cheaper than Opus 4.6/4.7, making it a very cost-effective alternative for this task.

The Leaderboard

1. Claude 4.7 Opus plan – 68

2. GLM 5.1 – 61

3. Claude 4.6 Opus plan – 59

4. Mimo v2.5 pro – 58

5. Qwen 3.6+ – 55

6. Sonnet 4.6 – 55

7. Gemini 3.1 pro – 53

8. Kimi K2.6 – 49

9. GPT 5.4 xHigh – 49

10. Gemini 3 flash – 47

11. Claude 4.7 Opus fast – 46

12. Minimax m2.7 – 36

13. Gemma4-e4b (Local rx6700) – 32

14. Gemma4-26b (Local CPU) – 18

Takeaway

Even the best LLMs default to the path of least resistance: "just make it work." If you want native-feeling, fully integrated UX, you cannot rely on the model's implicit knowledge; you have to explicitly prompt for it.

I've published the full leaderboard, the exact prompts used, the detailed scoring grid, and all the generated code in the GitHub repository here: https://github.com/guilamu/llms-wordpress-plugin-benchmark

I will be testing Level 2 prompt next, feeding the models a Wordpress+Gravity Forms reference file to see how they adapt.

Comments

ronin_niron•2m ago
On point #3 — .normalize('NFD') alone isn't enough for search matching. After NFD you still need to strip combining marks (the combining-diacritical-marks block, regex /\p{Mn}/gu) or "café" and "cafe" will never match because the accent is a separate codepoint that the comparator sees. Bit of a footgun; nobody writes the full pipeline correctly the first time.

GPT-5.5 – No ARC-AGI-3 scores

1•AG25•2m ago•0 comments

Show HN: SkySignal – An APM that opens PRs to fix your bugs

https://skysignal.app/
1•mike_tech•2m ago•0 comments

Layoffs.FYI

https://layoffs.fyi/
1•rickcarlino•2m ago•0 comments

Sophia: A Scalable Second-Order Optimizer for Language Model Pre-Training

https://arxiv.org/abs/2305.14342
1•Anon84•5m ago•0 comments

Open Source Alternative of MuleSoft Agent Fabric

https://architect.salesforce.com/docs/architect/fundamentals/guide/mulesoft-agent-fabric-deep-dive
1•devansh__saini•7m ago•1 comments

Catch what breaks before it costs you

https://www.cavbot.io/
1•cavendishpl•7m ago•0 comments

Meta to cut 10% of jobs to 'offset' Mark Zuckerberg's AI spending

https://www.ft.com/content/fe875f6c-f45c-4dbd-9d18-168d1fdbfd5f
2•ViktorRay•8m ago•1 comments

Trail of Bits Skills Marketplace

https://github.com/trailofbits/skills/
1•wslh•8m ago•0 comments

Meta to Lay Off 10 Percent of Work Force in A.I. Push

https://www.nytimes.com/2026/04/23/technology/meta-layoffs.html
1•corvad•10m ago•1 comments

Made of Language by Claude

https://byclaude.net/book
1•pw•12m ago•0 comments

Who even uses jemalloc in 2026 anyway? (many major projects)

https://theconsensus.dev/p/2026/04/16/who-even-uses-jemalloc-anyway.html
1•birdculture•14m ago•0 comments

Mise dev goes full time on open source

https://jdx.dev/posts/2026-04-17-going-full-time-on-open-source/
1•jdxcode•14m ago•0 comments

Surveillance campaigns use commercial tools exploit long-known vulnerabilities

https://cyberscoop.com/surveillance-campaigns-use-commercial-surveillance-tools-to-exploit-long-k...
3•lschueller•15m ago•0 comments

Ask HN: I need a reminder-brain – how do you do it?

2•MollyRealized•18m ago•1 comments

Teaching Machines to Read – Early Exploration

https://www.daggerobelus.com/projects/teaching-machines-to-read/early-exploration/
4•jlukic•19m ago•0 comments

'Tokenmaxxing' as a weird new trend

https://newsletter.pragmaticengineer.com/p/the-pulse-tokenmaxxing-as-a-weird
2•andsoitis•22m ago•1 comments

Microsoft offers buyouts up to 7% of US employees

https://techcrunch.com/2026/04/23/microsoft-offers-buyout-for-up-to-7-of-u-s-employees/
38•darth_avocado•23m ago•50 comments

TorchTPU: Running PyTorch Natively on TPUs at Google Scale

https://developers.googleblog.com/torchtpu-running-pytorch-natively-on-tpus-at-google-scale/
2•mji•24m ago•0 comments

Our Companywide NanoClaw Setup

https://bitsafe.notion.site/Building-a-Company-Wide-AI-Assistant-Architecture-Security-and-Self-I...
2•akibalogh•25m ago•0 comments

Ask HN: Incoming Meta layoffs, what now?

3•psychanarch•25m ago•0 comments

Toxins from Great Salt Lake dust are absorbed by plants, soils and human bodies

https://phys.org/news/2026-04-toxins-great-salt-lake-absorbed.html
2•littlexsparkee•26m ago•0 comments

AI-Powered Tool Helps Computer Architects Boost Processor Performance

https://news.ncsu.edu/2026/04/cachemind-tool-computer-architecture/
1•rbanffy•27m ago•0 comments

For Enterprises, GPUs Need Virtualization as Much as CPUs Ever Did

https://www.nextplatform.com/control/2026/04/10/for-enterprises-gpus-need-virtualization-as-much-...
2•rbanffy•27m ago•0 comments

Four Years of GreptimeDB: Decisions, Detours, and What We Got Wrong

https://greptime.com/blogs/2026-04-21-greptimedb-four-years-retrospective
1•JeremyFeng•27m ago•0 comments

Why AI data centers might lower electricity prices – not raise them

https://bigthink.com/science-tech/why-ai-data-centers-might-lower-electricity-prices-not-raise-them/
1•lschueller•28m ago•0 comments

How the Tech World Turned Evil

https://newrepublic.com/article/208876/tech-world-evil-musk-bezos-thiel
6•thomasstephan•29m ago•0 comments

Show HN: Object Storage Vendors – Compared

https://storage.mixpeek.com
1•Beefin•32m ago•0 comments

GitHub Merge Queue Silently Reverted Code

https://www.githubstatus.com/incidents/zsg1lk7w13cf
8•matthewbauer•32m ago•2 comments

Personal Daily Briefing (PDB)

https://icbrief.org/
3•billybuckwheat•33m ago•0 comments

How to Be a Hugo Nominee and Come Out of It Happy About the Honor

https://bsky.app/profile/seananmcguire.bsky.social/post/3mjzp5gmipc2z
1•mooreds•33m ago•0 comments