Can we test it? Yes, was can [video]

https://www.youtube.com/watch?v=MqC3tudPH6w

81•zdw•7mo ago

Comments

webdevver•7mo ago

i thought that's what customers were for?

somewhereoutth•7mo ago

My feeling on testing:

- If it is used by a machine, then it can be tested by a machine.

- If it is used by a human, then it must be tested by a human.

alex_smart•7mo ago

But testing by human is expensive

Anonbrit•7mo ago

and horribly unreliable even when done by competent and motivated humans, let alone most IT workers

karolinepauls•7mo ago

I've heard of good experiences using a 3rd party QA company (for frontend-heavy changes) and had pretty OK experiences doing it in-house with subject experts (though testing backend changes, so pretty much "either it works or it doesn't" in my case).

somewhereoutth•7mo ago

A good QA person/team will (be strongly incentivized to) make it their mission to find bugs and usability issues. Likely they will be domain experts, if only in the domain of the usage of your product. Worth their weight in gold, no machine can give you that.

Retric•7mo ago

Expensive isn’t unnecessary.

Developer written tests can’t tell you if your UI is intuitive for novice users.

sfn42•7mo ago

They can however tell you whether the button/form/page/whatever is working and continues working.

dylan604•7mo ago

Do those developer created tests include weird and random text that users will unwittingly break your form with? More than once has a form been broken because devs didn’t test for emoji or other glyphs in the text. Just because a button click calls a form with as expected doesn’t not mean the form will behave as expected with unexpected user input

sfn42•7mo ago

Playwright is essentially a programmable headless browser. It sends a request to a page, waits for the response to load and then for things like React rendering, secondary requests etc, then once the page is ready you can do whatever you want. You can interact with the DOM, fill in forms, scroll, click things, get a div and check its size and location, anything you want. Testing special symbols and such should be no problem at all, provided that you actually think to write that test.

Retric•7mo ago

Which is can be used for a range of things but doesn’t remove the need for independent testing.

To use one tiny concrete example. Someone who isn’t aware of U+0022 + U+201C quotation marks will happily use U+0022 everywhere without ever considering the very slightly more aesthetically pleasing option. Independent validation is about redundancy, ideally you don’t want anything to reach that stage but people are really bad at testing for issues the’ve never considered. Usability in terms of colorblindness not just actual blindness yadda yadda good testing is really involved.

necovek•7mo ago

There is fuzzing, FWIW.

But the same also holds for people who are "independent testers" and unaware of the same issues.

What I found is that software developers are bad at testing edge cases while they are in the creation mode (when they focus on happy path), but that good engineers switch to breaking mode once they try to cover things with sufficient tests. TDD also encourages breaking things first, but really, this is a mindset change that is usually skipped.

Retric•7mo ago

Fuzzing isn’t going to find either of the issues I mentioned. It’s useful for finding the kind of issues most developers I have worked with are well aware of, though may not specifically test for.

vlovich123•7mo ago

why do you think that fuzzing won’t try inputting the alternate question mark in your input?

For what it’s worth, if fuzzing is unlikely to, then I think so is manual testing unless you get really lucky or an expert is trying to break your system because they know of frequent issues with Unicode handling.

vlovich123•7mo ago

When testing with a human: you catch it in QA runs late in the cycle that test case once it’s added to the test coverage as a known issue

When testing automated: you prevent the commit in main sequence nice the regression test is added to the test coverage.

This is the point - adding tests to your automation suite gets cheaper over time and guarantees higher quality than manual QA runs with not much more cost to the business. Not doing so is choosing to take on tech debt. But pretending like random text is something your QA will catch on their own is a pipe dream. You’re better off investing in property tests and fuzzing.

somewhereoutth•7mo ago

Perhaps cheaper than upsetting your customers when they are the ones doing the testing?

No machine testing can completely (or even mostly) validate a human interface.

sfn42•7mo ago

Tools like playwright allow pretty nice web UI testing. You can make sure things are working properly and continue working as things change.

Doesn't replace human testing but it does ease the human load and help catch problems and regressions before they get to the human testers.

Ygg2•7mo ago

> Tools like playwright allow pretty nice web UI testing.

Can they test for color blindness and myopia?

dylan604•7mo ago

“Doesn't replace human testing”

It’s like you stopped reading to try to score internet points or something. The answer to your question was one more sentence from where you stopped reading

Ygg2•7mo ago

The point of the question was to both see if the bar has changed and to address the overall article's point "Can we test it".

sfn42•7mo ago

There are tools that can analyze a page for accessibility. I don't have experience with them but I wouldn't be surprised if you could include them in a CI/CD pipeline. I don't know whether they specifically control for color blindness and myopia but I wouldn't be surprised if they do. I know they check for contrast and stuff like that.

You could also check things like colors etc using Playwright but I would say it's probably the wrong tool for that job. It's more about testing functionality - make sure a page has the right content and works correctly on a technical level.

Without automated tools this type of thing can take a lot of time - in order to ensure quality you would basically have to click through the entire application for every release. Otherwise you might end up with some minor changes to one page breaking a different page and you'd never know until a tester checked it out or a user complained. With Playwright and similar tools you can catch regressions like this automatically.

__MatrixMan__•7mo ago

User acceptance testing is a good idea but doing it in cases where you can get away with cheaper testing is not.

diggan•7mo ago

Seems like blog spam, the actual content (presentation/talk) is at: https://www.youtube.com/watch?v=MqC3tudPH6w

tomhow•7mo ago

Thanks, we updated the URL from https://antithesis.com/blog/2025/bugbash_2025/mitchell_hashi...

cloogshicer•7mo ago

I think what people really mean when they say "This can't be tested" is:

"The cost of writing these tests outweighs the benefit", which often is a valid argument, especially if you have to do major refactors that make the system overall more difficult to understand.

I do not agree with test zealots that argue that a more testable system is always also easier to understand, my experience has been the opposite.

Of course there are cases where this is still worth the trade-off, but it requires careful consideration.

MoreQARespect•7mo ago

It's often shorthand for "this cant be unit tested" or "this isnt dependency injected" even though integration tests are perfectly capable of testing non-DI code.

The author's claims that we should isolate code under test better and rely more on snapshot testing are spot on.

diggan•7mo ago

> rely more on snapshot testing are spot on

Never quite liked "snapshot testing" which I think has a better name under "golden master testing" or similar anyways.

Reason for the dislike, is that it's basically a codified "Trust me bro, it's correct" without actually making clear what you are asserting with that test. I haven't found any team that used snapshot testing and didn't also need to change the snapshots for every little change, which obviously defeats the purpose.

The only things snapshot testing seems to be good for, is when you've written something and you know it'll never change again, for any reason. Beyond that, unit tests and functional/integration tests are much easier to structure in a way so you don't waste so much time reviewing changes.

MoreQARespect•7mo ago

>I haven't found any team that used snapshot testing and didn't also need to change the snapshots for every little change, which obviously defeats the purpose

I dont see how this even defeats the point, let alone obviously.

If a UI changes I appreciate being notified. If a REST API response changes I like to see a diff.

If somebody changes some CSS and it changes 50 snapshots, it isnt a huge burden to approve them all and sometimes it highlights a bug.

parthdesai•7mo ago

In the talk, he also talks about passing a flag which would actually update the snapshots/golden files if needed

necovek•7mo ago

Are you really going to be reviewing all those 50 snapshots carefully?

The bound of testing on the "other side" is to test just enough not to increase the maintenance burden too much.

diggan•7mo ago

> I dont see how this even defeats the point, let alone obviously.

You generally don't want to have to change all the tests for every change, particularly implementation details. Usually when people do snapshot testing of for example UI components, they serialize the entire component then assert the full component is the same as the snapshot, so any change requires the snapshot to be updated.

> If somebody changes some CSS and it changes 50 snapshots, it isnt a huge burden to approve them all and sometimes it highlights a bug.

Lets say person A initially created all these snapshots, person B did a change that shows 50 snapshots changed, who's responsibility is it to make sure the snapshots are correct? Person A doesn't have the context of the change, so less ideal. Person B doesn't know the initial conditions Person A had in mind, so also less ideal.

When you have unit tests and functional tests you can read through the test and know what the person who wrote it wanted to test. With snapshots, you only know that "This was good", sometimes only with the context of the name of the test itself, but no assertions you can read and say "Ah, X still shows Y so all good".

MoreQARespect•7mo ago

>You generally don't want to have to change all the tests for every change

You generally don't want every change to result in a lot of work. If changing a lot of tests means looking at a table of 30 images and diffs, scanning for problems and clicking an "approve button", that isn't a lot of work though.

>Lets say person A initially created all these snapshots, person B did a change that shows 50 snapshots changed, who's responsibility is it to make sure the snapshots are correct?

The person who made the change.

>Person B doesn't know the initial conditions Person A had in mind, so also less ideal.

Yes they will because the initial conditions also had a snapshot attached. If your snapshot testing is even mildly fancy it will come with a diff too.

>When you have unit tests and functional tests you can read through the test and know what the person who wrote it wanted to test. With snapshots, you only know that "This was good",

If you made a change and you can see the previous snapshot, current snapshot and a diff and you never know if the change was ok then you probably shouldn't be working on the project in the first place.

And no, the same isn't necessarily true of unit or functional tests - I've seen hundreds of unit tests that assert things about objects and properties which are tangentially related to the end user and come with zero context attached and I have to try and figure out wtf the test writer meant by "assert xyz_obj.transitional is None". With a user facing snapshot it's obvious.

tharkun__•7mo ago

No regular developer will carefully review 50 changed snapshots. They'll stop doing a proper job of it after the third or fourth that looks like the same trivial unimportant change and miss the bug found by snapshot 37.

I do agree that a lot of people write bad tests meaning the test name does not properly describe what the test is supposed to be about so I can check the test implementation and assertions against intent. They also like you say assert on superfluous things.

The problem with snapshots is that it's doing the exact same thing. It asserts on lots of completely unimportant stuff. Unlike proper unit tests however I can't make it better. In a unit test I can make an effort to educate my peers and personally do a good job of only asserting relevant things and writing individual tests where the test name explains why "transitional has to be None".

Snapshots are a blunt no-effort tool for the lazy dev that then later requires constant vigilence and overcoming things humans are bad at (like carefully checking 50 snapshots) by many different humans VS a unit test that I can make good and easy to comprehend and check by putting in effort once when I write it. A good one will also be easy to adjust when needed if it comes time to actually need to change an assertion.

MoreQARespect•7mo ago

>No regular developer will carefully review 50 changed snapshots.

If you follow a habit of making small, incremental changes in your pull requests (which you should anyway), those 50 snapshots will generally all change in the same way. A glance across all of them to see that (for example) a box moved to the left in all of them.

>The problem with snapshots is that it's doing the exact same thing. It asserts on lots of completely unimportant stuff.

A problem more than made up for by the fact that rewriting it, eyeballing it and approving it is very quick.

>Unlike proper unit tests however I can't make it better

If the "then" part of a unit test "expects" a particular error message or json snippet, for instance, it's a giant waste of time to craft the expected message yourself if the test can simply snapshot the text from the code and you can eyeball it to see if the diff looks correct.

I have written thousands of unit tests like this and saved a ton of time doing so. I've also worked with devs who did it your way (e.g. craft a whole or part of the json output and put it in the assert) and the only difference was that they took longer.

>and personally do a good job of only asserting relevant things

If, for example, you're testing a dashboard and the spec is a hand scribbled design that shows roughly what it looks like, all the other ways to assert the relevant things are vastly more expensive and impractical to implement.

In practice most devs (and you too i expect) will simply leave the design untested and design regressions by, say, some gnarly css issue would be undetected except by manual test.

>Snapshots are a blunt no-effort tool for the lazy dev

95% of devs don't use snapshot tests because they're super sensitive to flakiness and because ruthlessly eliminating flakiness from code and tests requires engineering discipline and ability most devs don't have.

For those who can, it massively accelerates test writing.

tharkun__•7mo ago

Regrettably we still have some snapshot tests in our code base, yes. I cringe every time one goes red and I'm supposed to check them. Like you say, I eyeball them and after the fourth that's the same pattern I give up and just regenerate them. Meaning they might as well not be there coz they won't catch any actual bugs for me.

We try to replace them every time we come across one that needed adjusting actually. Quick is bad here. And yes they're flaky as hell if you use them for everything. Even a tiny change to just introduce a new element that's supposed to be there can change unrelated parts of snapshots because of generated names in many places.

Asserting on the important parts of some json output is not generally more expensive at all. You let the code run to generate the equivalent of a snapshot and then paste it in the assertion(s) and adjust as necessary. Yes it takes more time than a snapshot. But optimizing for time at that end is the wrong optimization. You're optimizing one devs time expenditure while making both the reviewers' and later devs' and reviewers' time expenditure larger (if they want to do a proper job instead of eyeballing and then YOLOing it).

As I see it, devs using snapshots are the opposite of a 10x dev. It's being a 0.1x dev. Thanks but no thanks.

MoreQARespect•7mo ago

>We try to replace them every time we come across one that needed adjusting actually. Quick is bad here. And yes they're flaky as hell if you use them for everything. Even a tiny change to just introduce a new element that's supposed to be there can change unrelated parts of snapshots because of generated names in many places.

If you can't keep the flakiness under control then yeah, they'll be worse than useless because they will fail for no discernible reason at all.

tharkun__•7mo ago

Oh the reasons are discernable. I call it flaky when you make an unrelated change and the snapshots change. You go check why and all you can do is a facepalm. What you and I call "unrelated" may be different. Such as when I make a CSS change that simply affects some generated class names for example and a bunch of snapshots fail. This will be worse in code bases with lots of reusable CSS of course, i.e. your blast radius for flakiness will be much larger the more CSS reuse you have and the more snapshot tests you have. Ours is very controllable but only because we're doing the right things (such as reducing snapshot use).

That's when you start cursory looks at the first few changes and then just regenerate them, which means they will never find any actual bugs coz you ignore them.

It's "the boy who cried wolf" basically.

diggan•7mo ago

> scanning for problems and clicking an "approve button", that isn't a lot of work though.

But you're actually mentally listing requirements for each one of those snapshots you check, which hopefully is the same as the previous person who run it, but who knows?

> Yes they will because the initial conditions also had a snapshot attached. If your snapshot testing is even mildly fancy it will come with a diff too.

Maybe I didn't explain properly. Say I create a component, and use snapshot testing to verify that "This is how it should look". Now next person changes something that makes that snapshot "old", and the person needs to look at the diff and new component, and say "Yeah, this is now how it should look". But there is a lot of things that are implicitly correct in that situation, instead of explicitly correct. How can we be sure the next person is mentally checking the same requirements as I did?

> If you made a change and you can see the previous snapshot, current snapshot and a diff and you never know if the change was ok then you probably shouldn't be working on the project in the first place.

It seems to work fine for very small and obvious things, but for people make changes that affect a larger part of the codebase (which happens from time to time if you're multiple people working on a big codebase), it's hard to needing to implicitly understand what's correct everywhere. That's why unit/functional tests are so helpful, they're telling us what results we should expect, explicitly.

> I've seen hundreds of unit tests that [...] With a user facing snapshot it's obvious.

I agree that people generally don't treat test code with as much thought as other "production" code, which is a shame I suppose. I guess we need to compare "well done snapshot testing" with "well done unit/functional testing" for it to be a fair comparison.

For that last part, I guess we're just gonna have to agree to disagree, most snapshot test cases I come across aren't obvious at all.

MoreQARespect•7mo ago

>Maybe I didn't explain properly. Say I create a component, and use snapshot testing to verify that "This is how it should look". Now next person changes something that makes that snapshot "old", and the person needs to look at the diff and new component, and say "Yeah, this is now how it should look". But there is a lot of things that are implicitly correct in that situation, instead of explicitly correct. How can we be sure the next person is mentally checking the same requirements as I did?

This is a problem that applies equally to code and all other tests to an even greater extent. For them it is a problem dealt with with code reviews.

The great thing about snapshots which doesn't apply to code and functional tests is that they can be reviewed by UX and the PM as well. In this respect they are actually more valuable - PM and UX can spot issues in the PR. The PM doesn't spot when you made a mistake interpreting the requirements that is only visible in the code of the functional test or when it's something the functional test didn't check.

>It seems to work fine for very small and obvious things, but for people make changes that affect a larger part of the codebase (which happens from time to time if you're multiple people working on a big codebase), it's hard to needing to implicitly understand what's correct everywhere

It should not be hard to ascertain if what you see is what should be expected. E.g. if text disappears from a window where there was text before and you only made a styling change then that's obviously a bug.

>I guess we need to compare "well done snapshot testing" with "well done unit/functional testing"

Snapshots are not a replacement for functional tests they are a part of good functional tests. You can write a functional test that logs in and cheaply checks some arbitrary quality of a dashboard (e.g. the div it goes in exists) or you can write a functional test that cheaply snapshots the dashboard.

The latter functional test can give confidence that nothing broke when you refactor the code underneath it and the snapshot hasn't changed. The former can give you confidence that there is still a div present where the dashboard was before. This is significantly less useful.

hansvm•7mo ago

The purpose of snapshot testing is to not have observable changes if you think there shouldn't be observable changes. To that end, a pattern I like is:

- Don't store/commit the snapshot and have an "update" command. Your CI/CD should run both versions of the software and diff them. That eliminates a lot of the toil.

- You should have a completely trivial way to mark that a given PR intends to have observable changes. That could be a tag on GitHub, a square-bracket thing in a commit message, etc. Details don't matter a ton. The point is that the test just catches if things have changed, and a person still needs to determine if that change is appropriate, but that happens often enough that you should make that process easy.

- Culturally, you should split out PRs which change golden-affecting behavior from those which don't. A bundle of bug fixes, style changes, and a couple new features is not a good thing to commit to the repo as a whole.

The net effect is that:

1. Performance improvements, unrelated features, etc are tested exactly as you expect. If your perf enhancement changes behavior, that was likely wrong, and the test caught it. If it doesn't, the test gives you confidence that it really doesn't.

2. Legitimate changes to the golden behavior are easy to institute. Just toggle a flag somewhere and say that you do intend for there to be a new button or new struct field or whatever you're testing.

3. You have a historical record of the commits which actually changed that behavior, and because of the cultural shift I proposed they're all small changes. Bisecting or otherwise diagnosing tricky prod bugs becomes trivial.

ChrisMarshallNY•7mo ago

This is the case.

I did a lot of work on hardware drivers and control software, and true testing would often require designing a mock that could cost a million, easy.

I've had issues, with "easy mocks"[0].

A good testing mock needs to be of at least the same Quality level as a shipping device.

[0] https://littlegreenviper.com/concrete-galoshes/#story_time

fleventynine•7mo ago

I've had a lot of success writing driver test cases against the hardware's RTL running in a simulation environment like verilator. Quick to setup and very accurate, the only downside is the time it takes to run.

And if you want to spend the time to write a faster "expensive mock" in software, you can run your tests in a "side-by-side" environment to fix any differences (including timing) between the implementations.

necovek•7mo ago

It's cool to learn about Verilator: I've been proposing our HW teams give us the simulations based on their HW design for us to target with SW, but I am so out of the loop on HW development, that I can't push them in this direction (because they'll just give me: "that's interesting, but it's hard", which frustrates me to no end).

Can you perhaps do a write-up of what you've done and how "slow" it was, and if you've got ideas to make it faster?

thrtythreeforty•7mo ago

The hardest part is toolchains, for two reasons. First, Verilator doesn't have complete SV language support, although it's gotten better. Second, hardware has a tendency to accumulate some of the most contorted build systems I've ever seen and most hardware engineers don't actually know how to extricate it.

Once it's actually successfully run through Verilator, it's a C++ interface. Very easy to integrate if your sim already has a notion of "clock tick."

maccard•7mo ago

> A good testing mock needs to be of at least the same Quality level as a shipping device.

unfortunately I disagree - it needs to be the same quality as the device. If youe mock is reliable but your device isn't, you have a problem.

ChrisMarshallNY•7mo ago

Good point.

That was actually sort of the problem I had, in my story.

necovek•7mo ago

I like to put it on its head: a proper fake for any component should be designed by the authors of the component: they can provide one to the same behaviour relatively cheaply.

With hardware, I try to ask for simulated HW based on the designs, but I usually don't get ever it.

convolvatron•7mo ago

If I'm working next to the hardware group, I generally write my own. this allows me to make progress on drivers/firmware before hardware is available. if its an asic we can even spend a little time making it run in the DV environment - they get vectors for free and overall we get more confidence that the firmware/driver is going to work on delivered silicon.

if something doesn't work on actual hardware, now we're in a really good place to have to a conversation. clearly the simulator differs from the actual design, and we can just focus on sussing that out. otherwise the conversation can be alot more difficult and can devolve into 'hardware's broken' vs 'software person doesn't have a clue'.

rhizome31•7mo ago

Testing is a skill. The more you do it, the less expensive it becomes.

cloogshicer•7mo ago

The main cost isn't writing the tests themselves but the increased overall system complexity. And that never goes down.

diggan•7mo ago

> but the increased overall system complexity

I think this happens because people don't treat the testing code as "production code" but something else. You can have senior engineers spending days on building the perfect architecture/design, but when it comes to testing, they behave like a junior and just writes whatever comes to mind first, and never refactor things like they would "production code", so it grows and grows and grows.

If people could spend some brain-power on how to structure things and what to test, you'd see the cost of the overall complexity go way down.

vlovich123•7mo ago

Sounds like you’re responding to the title without listening to the presentation. He literally says this in the intro.

aspenmayer•7mo ago

When this happens, how do you determine who gets the karma? Is it right and just and logical for OP to get karma for submitting a URL that HN readers didn't visit after being updated by mods, or for OP to get karma previously for a URL that was deemed lacking with regards to the guidelines? It seems like they should get one or the other, but not both.

Just some food for thought. The reason I mention it, is that a person who has been commented upon by me previously for using scripts submitted this before OP, and if precedent holds, they should get the karma, not OP. But they have been commented upon by mods for having used scripts, but somehow haven't been banned for doing so, because dang has supposedly interacted with them/spoken with them, as if that could justify botting. But I digress.

To wit:

https://news.ycombinator.com/item?id=44449650

sebastianmestre•7mo ago

They're fake internet points, it's no big deal

aspenmayer•7mo ago

Then why don't you email mods to delete your account?

Your post adds nothing to the discussion I made with my reply, so what are you even doing here?

thunderbong•7mo ago

Because it's possible they didn't know about it? Because it's possible they might post something of more value another time?

Let's assume good faith.

aspenmayer•7mo ago

Who didn't know about what now?

This is HN inside baseball. If you or they don't know, they should ask somebody or lurk more, to be blunt. To phrase this in a better way, they should post better if they want a better response.

tomhow•7mo ago

It's an imperfect system and some human judgment about fairness applies.

In this case, the original URL submitted had the YouTube video prominently embedded, along with some commentary. It's no big deal to do that, as sometimes the commentary adds something. In this case nobody seems to think it does so I updated the URL to the primary source, but there's no need to penalize the submitter.

If the primary/best source for a topic has been submitted by multiple people, all being equal we'll promote the first-submitted one to the front page and mark later ones as dupes.

But things aren't always equal, and if the first submission was from a user who submits a lot and gets many big front page hits, we don't feel the need to just hand them more karma and will happily give the points to a less karma-rich user who submitted it later, especially if theirs is already on the front page.

aspenmayer•7mo ago

> But things aren't always equal, and if the first submission was from a user who submits a lot and gets many big front page hits, we don't feel the need to just hand them more karma and will happily give the points to a less karma-rich user who submitted it later, especially if theirs is already on the front page.

I know dang has said that generated comments and bots are against HN guidelines previously, but should I read what you're not saying between the lines, and should I interpret it as consistent with HN policy to use scripts or bots to post to HN? Because that seems to be happening in this case, and keeps coming up, because it keeps happening. After a certain point, inaction is a choice and a policy by proxy.

inb4 someone asks what it is and what it is that is happening; if you don't already know, ask somebody else, and if you're not a mod on HN, I'll likely consider your response a concern troll if it isn't on topic for this thread or attempts to derail it, which I fully expect to happen, but I am constantly willing to be surprised, and often am on this site.

As the previous construction was rather strained, I'll say it plainly:

Is it okay, as in, consistent with and affirming the HN guidelines, to use scripts/bots to post to HN, or not? My reading tells me no.

tomhow•7mo ago

We ban bots that submit a lot of spam or other content that’s a bad fit for HN. We ban LLM-comment bots because we want the discussions to be between humans sharing original human ideas.

If someone has written a script that finds and submits articles that are good for HN, I don’t see why we should ban them. We can use human judgment to decide which of their posts should be rewarded or downranked; we’re doing manual curation all the time anyway.

aspenmayer•7mo ago

> We ban bots that submit a lot of spam or other content that’s a bad fit for HN. We ban LLM-comment bots because we want the discussions to be between humans sharing original human ideas.

> If someone has written a script that finds and submits articles that are good for HN, I don’t see why we should ban them. We can use human judgment to decide which of their posts should be rewarded or downranked; we’re doing manual curation all the time anyway.

You should ban them for the same reason generated comments are banned.

This is not a great outcome for HN, so I don't expect this to actually occur, mind you!

I just think that the status quo unfairly advantages those who have already demonstrated that they're actively and successfully gaming the system. If the points don't matter, then script users' contributions matter even less than a human-initiated post, so why not run the script in-house under an official username at that point. This arm's length scripted behavior leaves a bad taste after Digg and every other site that has done this, or allowed others to do it. Either the content is user-submitted, or it isn't. Bots aren't people.

tomhow•7mo ago

I don’t know what kind of game is going on when you quote my whole comment like this. That’s not a norm on HN.

We don’t treat them the same because they don’t have the same effects and aren’t the same thing. They’d be the same thing if someone made a bot to write whole articles and post them as submissions. Of course we’d ban that, and indeed we have banned bots like that, or at least banned sites whose contents are mostly LLM-generated, along with the accounts that submitted them.

If a user’s script finds and submits good articles that attract many upvotes and stimulate good discussions, it would be a net loss for HN to ban them.

aspenmayer•7mo ago

It's not a game, but many people try to derail the conversation when I bring this up. I want to have a record of what was said when I reply when I think it's worth quoting, so as there is no ambiguity as to what was said or what I was replying to.

> If a user’s script finds and submits good articles that attract many upvotes and stimulate good discussions, it would be a net loss for HN to ban them.

I agree. So why not run that script in-house, so that we have transparency about the account being scripted? Or, the script user could say something to that effect in their bio. Or, they could use a dedicated scripting account. Anything would be better than this, because it's a bad look for HN, and I'm tired of talking about it, but it's an issue with values to allow scripted submissions as long as they're URLs, but not if they're comments. It's a distinction without a difference to my view.

That being said, I can't disagree that they find good content. I am fine with it being a quirk of the rules that scripts and bots are allowed. I don't think it's what's best for HN, and I don't think that it's the status quo, but as you say, you do a lot of manual intervention. If a script user is making posts that are good, that is reducing your workload, so I think you may be close enough to the situation to care much more than I do, and so I take your view to heart and trust your judgement that it's not a problem to you or HN in your view, but I think differently, and I don't know what you know. If I did know what you do know, I'm willing to believe I would think as you do, so I don't mean to accuse or blame, or find fault.

I like the topic because after a certain point, generated comments and posts may be indistinguishable from other HN posts, which would be weird, but I would be okay with that as long as the humans remain and are willing to curiously engage with each other. I'm not really anti-AI at all, I just find the guidelines rather helpful, and yet I hate to be a scold. Please don't interpret this thread as critical of HN, but rather bemused by HN.

tomhow•7mo ago

The thing is you just don’t seem to be able to convey why it is that it’s such an important issue. Nobody else cares about it. But you take us to task about it again and again with these lengthy comments but not a clear statement of what the fundamental problem is.

For what it’s worth we have systems and methods for finding good new articles, like the second chance pool. We wouldn’t ban other people’s scripts for the same reason there’s always room in the marketplace for different variants of a product; someone else’s variant may be better than ours at least in some ways.

Ultimately there’s just no need for us to spend a whole lot of time thinking about it because it doesn’t cause problems (that we can’t address with routine human oversight).

aspenmayer•7mo ago

> The thing is you just don’t seem to be able to convey why it is that it’s such an important issue.

I have conveyed why it's important to me. Whether or not you find my exhortation convincing or not is likely not due to my lack of attempts to convince you why I feel the way I do. Of all the things you could find lacking, I don't think it's unclear. Scripted behavior isn't authentic. Coordinated inauthentic behavior is problematic to me, because I work in security amongst other hats I wear, and I have another name for coordinated inauthentic behavior.

> Nobody else cares about it.

Tomte cares, and posted in this and the other thread? I'm sure other people would care if they saw the thread. Funny how people only care about what they're aware of.

> For what it’s worth we have systems and methods for finding good new articles, like the second chance pool. We wouldn’t ban other people’s scripts for the same reason there’s always room in the marketplace for different variants of a product; someone else’s variant may be better than ours at least in some ways.

> Ultimately there’s just no need for us to spend a whole lot of time thinking about it because it doesn’t cause problems (that we can’t address with routine human oversight).

You don't have to spend any time to ask script users to mention it in their bio! If they don't, they don't. Rule breakers are not an indictment of the concept of making rules, or following ones that already exist, or closing gaps in the rules once identified.

If there was nothing I or anyone else could say to change your mind, perhaps the failure of communication is on your end, and may even be willful. I come to HN to interact with human beings making user-submitted posts and comments. That's what HN is to me, and this announcement is a departure from all prior communications, because you've laid bare what I drew attention to last time this came up, where Tomte also posted. Apparently people are scripting submissions and farming posts on HN. I don't see how that isn't a problem on its face. The fact that you know and do nothing because perfect detection and enforcement is impossible makes me wonder if the reason you allow it is because it is expedient to moderating HN, and not what is necessarily best for HN as determined by HN users.

> But you take us to task about it again and again with these lengthy comments but not a clear statement of what the fundamental problem is.

And yet the problem has been identified, and it remains signposted by me because the problem has been denied to exist in favor of criticisms of the length of my posts. What even is the issue? Should I post fewer, more convincing words? I am honestly at a loss as to how to continue this thread, so I will rest and await any reply from you or anyone else.

tomhow•7mo ago

Ok, I definitely need to step away from this thread because where I am it’s nearly midnight on a Sunday night and I’m emotionally exhausted after dealing with some family illness this evening. That’s what’s exacerbated my impatience with the topic and duration of the discussion. It’s just not how I expected/hoped to be spending my the last hours of my Sunday night when I just posted a routine mod comment about a url change. Sorry for letting that get the better of me.

That Tomte sees it as a problem is interesting, because I wouldn’t have been surprised to find they also used some kind of scripts to find articles to post; indeed I just casually assumed they did, at least to some extent. I mean that not as an accusation, just an impression I’d picked up from observing their posts over many years.

Ok, point taken about how it makes you feel about HN. I’ll think more about it as we continue to work on ways to improve everything.

aspenmayer•7mo ago

Thank you for your efforts. You and the entire mod team make this site worth visiting, so something you're doing must be working, and likely many things I don't need to know about or trouble myself with. Most sites don't even have named mods, so the fact that you post is here is just as important to me as anything else HN and you all have done to make this place what it is.

I trust you know what you're doing, or I wouldn't be here.

I hope you can rest and recharge. Nothing I said today or probably ever on HN is more important than the people in our lives, which is why I think preserving a place for humans is worth it, even if it's not perfect. I appreciate all you do, even if I have a strange way of showing it.

Tomte•7mo ago

I disagree with aspenmeyer violently on all those fairness issues that came up a few weeks ago in the similar discussion, but I think he‘s right with pushing back on this in general.

I am a very active HN user and was totally surprised by the declaration that submission bots are fine with you. It goes against pretty much all earlier communication (which, in fairness, was usually about comment bots), but I think in the past my submission behavior was repeatedly ruled okay when challenged by other users, because I‘m submitting manually.

I do feel I‘m losing interest a bit when we‘re all just firing scripts. Manual submission at least makes you care enough to spend those seconds, bot submissions mean nobody cares anymore because you can just fling shit and see what sticks. And maybe we high-volume submitters should even be reigned in more.

(Also it feels unfriendly towards lobste.rs, when HN is effectively just bulk copying their submissions.)

aspenmayer•7mo ago

Thanks for this. I think you make good points, and I apologize if I came off as hostile toward you directly or indirectly in that other thread. The closest I have come to scripted submissions is using the HN provided bookmarklet, and even that felt like scripting to me. I'm not a purist about this, but the inconsistency feels strange to me too, and I would rather not have scripts be the primary posting or comment method, but please keep the faith with me here on HN.

If we don't make an effort and intention to care and stay here despite bad calls by refs, we'll just have to take our ball and go home, but for many, they don't have another home like HN, so that would be a net loss for them. We owe it to ourselves and each other to show up where we want to effect change that wouldn't happen without our presence and involvement. That's what user generated content is all about!

tptacek•7mo ago

I don't know why this would matter, re: lobste.rs; it's two sites with the same remit and different groups of users, of course they're going to share stories.

It feels like this site can always use strong incentives for better, more informed, and more civil conversation in threads. But it doesn't need much incentive to get good stories posted; that happens organically.

The simplest solution here would be to eliminate "karma" outright.

Tomte•7mo ago

I see a difference between the same set of stories being submitted organically, and one VC-driven site basically scraping the smaller site.

Eliminating karma is probably not a bad idea, I don‘t think it will stifle submissions, but it may improve commenting dynamics.

tptacek•7mo ago

I don't see that distinction at all, for what it's worth, and while there are things I have liked about Lobste.rs, I don't think it's a more authentic community than this one. Either way, so long as there is karma, there's an arb to run between Lobste.rs and HN, and people are going to run it. Fighting it seems like a waste of energy.

j_w•7mo ago

My takeaway from this is that when you have a system or feature that "can't be tested" that you should try to isolate the "untestable" portions to increase what you can test.

The "untestable" portions of a code base often gobble up perfectly testable functionality, growing the problem. Write interfaces for those portions so you can mock them.

gblargg•7mo ago

This often comes naturally for portable code, where the OS-specific things are separated from the core logic that just processes data. His keyboard handling example illustrates this.

testthetest•7mo ago

The ROI on unit tests, as well as the answer to "Can we test it?" is changing fast in the age of AI.

1. AI is making unit tests nearly free. It's a no-brainer to ask Copilot/Cursor/insert-your-tool-here to include tests with your code. The bonus is that it forces better habits like dependency injection just to make the AI's job possible. This craters the "cost" side of the equation for basic coverage.

2. At the same time, software is increasingly complex: a system of a frontend, backend, 3rd-party APIs, mobile clients, etc. A million passing unit tests and 100% test coverage mean nothing in a world when a tiny contract change breaks the whole app. In our experience the thing that gives us the most confidence is black-box, end-to-end testing that tests things exactly as a real user would see them.

necovek•7mo ago

I've talked about this in a few places, but there is a simple way to get to higher testability and trust in the automated tests being done with little to no extra effort for complex systems — this slightly tunes how you define the testing pyramid, but it's important to always test exactly enough, or you hit maintenance burden that kills you down the line.

All it takes is for a component developer to build a fully functioning component by having proper fakes for all the external dependencies. If this is religiously applied through the org, then all components in the org will have all of their real business logic and integration points sufficiently tested.

Then, your end-to-end tests can work against a very complex system that is extremely fast, but which uses the actual business logic, other than the very extreme boundaries to external systems not under your control.

However, AI is usually not much of a help here, because most software is not written in the way to easily construct either a fake or real production component from the same component — and AI is trained on existing source code.

YZF•7mo ago

I haven't looked at much at the quality of unit tests generated by AI but I would imagine it would suffer from the same general problems as humans. That is that most of these unit tests would be bad tests. They would include assumptions about implementation and make changes to the code harder.

Yes- black box and end-to-end testing. Better forced by the nature of the black box to test the behavior of the system. It's still possible to test the wrong things (behaviors you don't care about) but that tends to stand out more. Also still difficult to make these tests reliable.

devjab•7mo ago

I don't think anyone ever regretted doing a test but runtime assertions are so much better because they deal with issues when they happen, rather than trying to predict potential failures. This was probably forgotten as interpreted languages became more popular, but hopefully we're going to see a return to less "developer focused" ways of dealing with errors, so that your pacemaker doesn't stop because someone forgot to catch the exception they didn't think could happen with their 100% coverage.

throwaway314155•7mo ago

> This was probably forgotten as interpreted languages became more popular

I think the notion that all interpreted languages instantly became test-driven in response to lacking good type systems is overblown. In practice both tests and runtime assertions are performed in e.g. Python. Usually more of the latter and less of the former in my experience.

devjab•7mo ago

I'm not sure they became instantly test-driven, but I do think that the exception handling model as default error handling in Java and C# was a big drive toward test-driven development. Eventhough both have the ability to do runtime assertions and even work with contracts to some degree. It's also why I think Go's panic is far preferable, but that's a different story, and Go doesn't have runtime assertions so there is that.

But it was probably more than just type systems and error handling. What it has meant, at least in my part of the world, is that a lot of programmers (and I mean the vast majority) are very bad at deal with errors exactly when they happen. This is anecdotal, but I think I've met one or two developers outside the telecom and energy sectors that knew you could do runtime assertions.

YZF•7mo ago

Deal how? Do you want run time assertions in your ABS firmware or your aircraft landing software?

devjab•7mo ago

Well yes, considering NASA wanted it in theirs. There are a lot of advantages in immediately failing when your software does something you don't want it to do rather than trying to continue in a corrupted or invalid state. As for how you deal with it, it depends. Erlang has a lot of "self-healing" supervisor/actor models which are widely used, but I don't have a lot of experience with those.

We use a form of recovery oriented computing in our solar plant components, though to be fair, we only use a very tiny part of it since we basically reboot to the last known valid state. Which isn't exactly 100% safety, but it's not like anyone dies if an inverter or datalogger goes offline for an hour.

I'm curious to know how you would deal with errors in these sort of things without runtime assertions or similar. Maybe my knolwedge is outdated?

YZF•7mo ago

I haven't done firmware in more than a decade but back when I did asserting was a no-no. You can't crash. Let's say I have a machine and I'm in the process of moving some actuators crashing might mean I destroyed the machine. That can't be better then letting things that can keep going going even if there's some unexpected condition. Physical safety is typically there by design and doesn't rely on software (there are some exceptions/processes for safety critical software).

So what to do:

- Extensive use of things like state machines where you can convince yourself that you are handling all possible inputs at every possible state.

- Clean/clear/simple code.

- Code reviews.

- Test the heck of your software before you ship it. Automated testing and manual testing.

Never fails isn't possible. Mechanical systems also fail. Electronics also fail. Software can be pretty reliable and either way you can't really tell what to do if there's an unexpected condition. Expected errors should be handled (let's say you try to move an actuator and it doesn't move, or you have a position error on your servo). Firmware can be designed to not run out of memory (so that failure mode can be eliminated by design).

What's to guarantee you won't be just rebooting in a loop if you reboot? That said, I guess if you have analyzed things well enough and that's your preferred recovery then fair enough. But how much state does software for an inverter have that can't just be handled by code?

devjab•7mo ago

> You can't crash.

That's the exact point of runtime assertions. You can't crash, so you fail exactly at the moment something is corrupted. One of the reasons Go didn't devided to include runtime assertions (and one of the only choises for Go I really dislike) is that they aren't exactly safe because you still have to deal with that failure, and I suppose it's very easy to fuck it up.

What you're describing in your post is essentially what I think of when I say runtime assertion. You use them to revert to the previous valid state and retry, you can micro-reboot, you can go into a "factory setting" where your software can continue until an engineer can actually work on it. Things like that. The primary difference is that with runtime assertions, you stop exactly when the corrupted states occours instead of trying to continue with that corrupted state.

It's not like this should replace testing or any of the other things you bring up. I still strongly recommend testing. Tests are for prevention of errors, however, and runtime assertions are for dealing with errors when they happen at runtime. Exception handling is the other way around it, but with exceptions you continue with the corrupted state and try to deal with it down the line. I don't personally like that.

> Never fails isn't possible.

In the previous decade we've had 0 software failures causing shutdown of equipment in any of our many solar parks. This is not to say we haven't had failures. Last I checked the data we've had around 700. incidents which required human intervention, but in every case, the software was capable of running at "factory settings" until the component could safely be repaired or replaced. By contrast we've had quite a lot of hardware failures. Now... it's not exactly life threatening if parts of a solar plant fails, at least not on most plants. In almost any case it'll only cost money, and the reason we're so tight on not failing is actually exactly that. The contracts around downtime responsibility are extremely rigid and my organisation cares quite a lot about placing that responsibility outside of "us". So somewhat ironically we're doing software "right" in this one place because of money, and not because it's the right thing to do. But hey, the work is fun.

> Clean/clear/simple code.

I would replace "Clean" in this part, but it depends on what you mean. YAGNI > Uncle Bob!

borg16•7mo ago

shout out to "A tribe called Quest" for the title (my guess)

Feedback on a client-side, privacy-first PDF editor I built

Clay Christensen's Milkshake Marketing (2011)

Show HN: WeaveMind – AI Workflows with human-in-the-loop

Show HN: Seedream 5.0: free AI image generator that claims strong text rendering

A contributor trust management system based on explicit vouches

Show HN: Analyzing 9 years of HN side projects that reached $500/month

The Floating Dock for Developers

Arcan Explained – A browser for different webs

We are not scared of AI, we are scared of irrelevance

Quartz Crystals

Show HN: I built a free dictionary API to avoid API keys

Show HN: Kybera – Agentic Smart Wallet with AI Osint and Reputation Tracking

Show HN: brew changelog – find upstream changelogs for Homebrew packages

Any chess position with 8 pieces on board and one pair of pawns has been solved

LLMs as Language Compilers: Lessons from Fortran for the Future of Coding

Projecting high-dimensional tensor/matrix/vect GPT–>ML

Show HN: Free Bank Statement Analyzer to Find Spending Leaks and Save Money

Our Stolen Light

Matchlock: Linux-based sandboxing for AI agents

Show HN: A2A Protocol – Infrastructure for an Agent-to-Agent Economy

Drinking More Water Can Boost Your Energy

Proving Laderman's 3x3 Matrix Multiplication Is Locally Optimal via SMT Solvers

Fire may have altered human DNA

"Compiled" Specs

The Next Big Language (2007) by Steve Yegge

Open-Weight Models Are Getting Serious: GLM 4.7 vs. MiniMax M2.1

Using AI for Code Reviews: What Works, What Doesn't, and Why

Show HN: Solnix – an early-stage experimental programming language

DoNotNotify is now Open Source

The British Empire's Brothels

Feedback on a client-side, privacy-first PDF editor I built

Clay Christensen's Milkshake Marketing (2011)

Show HN: WeaveMind – AI Workflows with human-in-the-loop

Show HN: Seedream 5.0: free AI image generator that claims strong text rendering

A contributor trust management system based on explicit vouches

Show HN: Analyzing 9 years of HN side projects that reached $500/month

The Floating Dock for Developers

Arcan Explained – A browser for different webs

We are not scared of AI, we are scared of irrelevance

Quartz Crystals

Show HN: I built a free dictionary API to avoid API keys

Show HN: Kybera – Agentic Smart Wallet with AI Osint and Reputation Tracking

Show HN: brew changelog – find upstream changelogs for Homebrew packages

Any chess position with 8 pieces on board and one pair of pawns has been solved

LLMs as Language Compilers: Lessons from Fortran for the Future of Coding

Projecting high-dimensional tensor/matrix/vect GPT–>ML

Show HN: Free Bank Statement Analyzer to Find Spending Leaks and Save Money

Our Stolen Light

Matchlock: Linux-based sandboxing for AI agents

Show HN: A2A Protocol – Infrastructure for an Agent-to-Agent Economy

Drinking More Water Can Boost Your Energy

Proving Laderman's 3x3 Matrix Multiplication Is Locally Optimal via SMT Solvers

Fire may have altered human DNA

"Compiled" Specs

The Next Big Language (2007) by Steve Yegge

Open-Weight Models Are Getting Serious: GLM 4.7 vs. MiniMax M2.1

Using AI for Code Reviews: What Works, What Doesn't, and Why

Show HN: Solnix – an early-stage experimental programming language

DoNotNotify is now Open Source

The British Empire's Brothels

Can we test it? Yes, was can [video]

Comments