frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

I'm tired of LLM skill slop, so I built mine with regression tests

3•iliaov•1h ago
I've recently tried skills like Garry Tan's GStack, spent a week with it, and realized it has some flaws (I'll post separately about that).

Here's my problem: how do I know if a skill or prompt is any good (e.g. GStack's /office-hours)?

How do I compare similar skills (e.g. different "deep research" skills)?

Spotting broken software is (relatively) easy — it crashes, prints errors. Broken skills don't. Perfectly polished, confident-sounding skills routinely mislead me and waste my time, to the point where I wish I weren't using an LLM at all.

AI skills are software — and they should come with regression tests.

LLM teams have tons of prompt regression tests. LLM-wrapper SaaS companies have tons of prompt regression tests. But when it comes to open-source skills, SKILL.md reads reasonable, yet ships with zero tests (e.g. GStack's /office-hours has none at the time of writing).

Garry Tan, if you hear me — please consider shipping regression tests for your /office-hours, /plan-ceo-review, /plan-eng-review, and so on.

Regression tests should:

1. Prove the skill works correctly

2. Demonstrate correct and incorrect usage

3. Prove the skill's value

4. Come with a scoring rubric to allow skill benchmarking

5. The last one is the most valuable, because it lets you benchmark similar skills against each other.

So I started doing this myself.

Here's a work-in-progress example: plan-cmo-review, a skill to complement GStack since GStack is missing a marketing review at the time of writing. I'm not a marketing guy; the point of sharing this skill is to outline its regression setup.

Briefly, here's how my exploration progressed:

- I used GStack on a couple of products and realized the resulting design_document.md was leading me to failure, mainly marketing-wise.

- I dug into the skill's failures manually with Claude Opus 4.8's help and ended up finding the correct solution.

- I asked Claude to build a plan-cmo-review skill, ran it, and it arrived at a flawed solution (similar to GStack's output).

- I gave Claude the correct (manual) solution to analyze and add as a regression fixture with a scoring rubric.

- Claude ran the (blind) regression — it failed. We iterated several times and found the key problem: Claude was trusting my prompts implicitly as the ultimate truth. Claude believed GStack knew what it was doing. GStack believed I knew what I was doing. But I was doing product/startup research — and by definition, "research" is what you do when you don't know what you're doing. That trust chain is what broke the skills.

- We fixed the trust problem and the regression test passed. We added a few more. They passed.

- I had Claude run the regressions multiple times — cracks appeared. Claude iterated the skill. Now they pass.

- This methodology is still flawed. I'd like to try running different LLMs, cross-model judging, and a lot more regression tests.

Skill github.com/remakeai/plan-cmo-review . Notes at iliaov.substack.com .

Ask HN: Gin rummy strategies

13•bix6•10h ago•2 comments

Ask HN: So what happened to Facebook "localhost" tracking?

71•juliusceasar•11h ago•82 comments

Ask HN: Spent thousands, got no customers. What's wrong with my site?

8•petebay•14h ago•12 comments

I'm tired of LLM skill slop, so I built mine with regression tests

3•iliaov•1h ago•0 comments

Ask HN: Why is it still so hard for LLMs to query NoSQL databases?

4•cammasmith•8h ago•1 comments

Ask HN: How do you find deep technical content?

19•f311a•11h ago•10 comments

Being privacy-conscious comes with some downsides

4•wqtz•2h ago•2 comments

Ask HN: Good books/resources for learning SQL?

3•CobaltFire•4h ago•1 comments

Ask HN: Who is hiring? (June 2026)

238•whoishiring•3d ago•415 comments

Ask HN: Who wants to be hired? (June 2026)

148•whoishiring•3d ago•496 comments

Life saving / first aid posters

33•cpu_•2d ago•3 comments

Ask HN: AI efficiency in the workplace

3•localhoster•10h ago•0 comments

Google killed my $1M ARR startup overnight

4•vadumo•13h ago•3 comments

Ask HN: What are all the ways to punch through NAT?

4•jupr•10h ago•5 comments

Tell HN: Max messenger app removed from App Store

8•secondary_op•20h ago•3 comments

Ask HN: A Brief History of LLMs

9•menomatter•1d ago•6 comments

Laid off. Broke. Depressed. & idk how to market my SaaS

14•touseefbuilds•1d ago•23 comments

Ask HN: Why Ask HN has only 14 questions now?

11•throwaw12•1d ago•3 comments

Ask HN: What is your opinion on index rule changes to accommodate Mega-Cap IPOs?

16•figmert•2d ago•10 comments

Ask HN: Why are so many Show HNs being flagged?

3•866-RON-0-FEZ•1d ago•6 comments

Angular jasmine unit tests are harder to code/maintain than the actual feature

4•GamingAtWork•2d ago•0 comments

I'm Done Using AI

32•nyxtom•2d ago•23 comments

$100 to a Debian Developer who can get Fresh Editor into Trixie

28•jph•4d ago•14 comments

Please don't spam people looking for employment. It's just cruel

960•IliaLitviak•2d ago•271 comments

Recruiters, How do you vet resume in 2026?

15•CoffeeSky•4d ago•8 comments

AI Goal: Senior Software Engineer

4•oryocyph•2d ago•5 comments

You've reached the end!