frontpage.

I'm tired of LLM skill slop, so I built mine with regression tests

3•iliaov•1h ago

I've recently tried skills like Garry Tan's GStack, spent a week with it, and realized it has some flaws (I'll post separately about that).

Here's my problem: how do I know if a skill or prompt is any good (e.g. GStack's /office-hours)?

How do I compare similar skills (e.g. different "deep research" skills)?

Spotting broken software is (relatively) easy — it crashes, prints errors. Broken skills don't. Perfectly polished, confident-sounding skills routinely mislead me and waste my time, to the point where I wish I weren't using an LLM at all.

AI skills are software — and they should come with regression tests.

LLM teams have tons of prompt regression tests. LLM-wrapper SaaS companies have tons of prompt regression tests. But when it comes to open-source skills, SKILL.md reads reasonable, yet ships with zero tests (e.g. GStack's /office-hours has none at the time of writing).

Garry Tan, if you hear me — please consider shipping regression tests for your /office-hours, /plan-ceo-review, /plan-eng-review, and so on.

Regression tests should:

1. Prove the skill works correctly

2. Demonstrate correct and incorrect usage

3. Prove the skill's value

4. Come with a scoring rubric to allow skill benchmarking

5. The last one is the most valuable, because it lets you benchmark similar skills against each other.

So I started doing this myself.

Here's a work-in-progress example: plan-cmo-review, a skill to complement GStack since GStack is missing a marketing review at the time of writing. I'm not a marketing guy; the point of sharing this skill is to outline its regression setup.

Briefly, here's how my exploration progressed:

- I used GStack on a couple of products and realized the resulting design_document.md was leading me to failure, mainly marketing-wise.

- I dug into the skill's failures manually with Claude Opus 4.8's help and ended up finding the correct solution.

- I asked Claude to build a plan-cmo-review skill, ran it, and it arrived at a flawed solution (similar to GStack's output).

- I gave Claude the correct (manual) solution to analyze and add as a regression fixture with a scoring rubric.

- Claude ran the (blind) regression — it failed. We iterated several times and found the key problem: Claude was trusting my prompts implicitly as the ultimate truth. Claude believed GStack knew what it was doing. GStack believed I knew what I was doing. But I was doing product/startup research — and by definition, "research" is what you do when you don't know what you're doing. That trust chain is what broke the skills.

- We fixed the trust problem and the regression test passed. We added a few more. They passed.

- I had Claude run the regressions multiple times — cracks appeared. Claude iterated the skill. Now they pass.

- This methodology is still flawed. I'd like to try running different LLMs, cross-model judging, and a lot more regression tests.

Skill github.com/remakeai/plan-cmo-review . Notes at iliaov.substack.com .

Ask HN: Gin rummy strategies

Ask HN: So what happened to Facebook "localhost" tracking?

Ask HN: Spent thousands, got no customers. What's wrong with my site?

I'm tired of LLM skill slop, so I built mine with regression tests

Ask HN: Why is it still so hard for LLMs to query NoSQL databases?

Ask HN: How do you find deep technical content?

Being privacy-conscious comes with some downsides

Ask HN: Good books/resources for learning SQL?

Ask HN: Who is hiring? (June 2026)

Ask HN: Who wants to be hired? (June 2026)

Life saving / first aid posters

Ask HN: AI efficiency in the workplace

Google killed my $1M ARR startup overnight

Ask HN: What are all the ways to punch through NAT?

Tell HN: Max messenger app removed from App Store

Ask HN: A Brief History of LLMs

Laid off. Broke. Depressed. & idk how to market my SaaS

Ask HN: Why Ask HN has only 14 questions now?

Ask HN: What is your opinion on index rule changes to accommodate Mega-Cap IPOs?

Ask HN: Why are so many Show HNs being flagged?

Angular jasmine unit tests are harder to code/maintain than the actual feature

I'm Done Using AI

$100 to a Debian Developer who can get Fresh Editor into Trixie

Please don't spam people looking for employment. It's just cruel

Recruiters, How do you vet resume in 2026?

AI Goal: Senior Software Engineer

I'm tired of LLM skill slop, so I built mine with regression tests

Ask HN: Gin rummy strategies

Ask HN: So what happened to Facebook "localhost" tracking?

Ask HN: Spent thousands, got no customers. What's wrong with my site?

I'm tired of LLM skill slop, so I built mine with regression tests

Ask HN: Why is it still so hard for LLMs to query NoSQL databases?

Ask HN: How do you find deep technical content?

Being privacy-conscious comes with some downsides

Ask HN: Good books/resources for learning SQL?

Ask HN: Who is hiring? (June 2026)

Ask HN: Who wants to be hired? (June 2026)

Life saving / first aid posters

Ask HN: AI efficiency in the workplace

Google killed my $1M ARR startup overnight

Ask HN: What are all the ways to punch through NAT?

Tell HN: Max messenger app removed from App Store

Ask HN: A Brief History of LLMs

Laid off. Broke. Depressed. & idk how to market my SaaS

Ask HN: Why Ask HN has only 14 questions now?

Ask HN: What is your opinion on index rule changes to accommodate Mega-Cap IPOs?

Ask HN: Why are so many Show HNs being flagged?

Angular jasmine unit tests are harder to code/maintain than the actual feature

I'm Done Using AI

$100 to a Debian Developer who can get Fresh Editor into Trixie

Please don't spam people looking for employment. It's just cruel

Recruiters, How do you vet resume in 2026?

AI Goal: Senior Software Engineer