frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

OpenAI’s Windsurf deal is off, and Windsurf’s CEO is going to Google

https://www.theverge.com/openai/705999/google-windsurf-ceo-openai
601•rcchen•9h ago•379 comments

ETH Zurich and EPFL to release a LLM developed on public infrastructure

https://ethz.ch/en/news-and-events/eth-news/news/2025/07/a-language-model-built-for-the-public-good.html
409•andy99•12h ago•59 comments

Faking a JPEG

https://www.ty-penguin.org.uk/~auj/blog/2025/03/25/fake-jpeg/
188•todsacerdoti•8h ago•37 comments

Replication of Quantum Factorisation Records with an 8-bit Home Computer [pdf]

https://eprint.iacr.org/2025/1237.pdf
64•sebgan•5h ago•7 comments

Preliminary report into Air India crash released

https://www.bbc.co.uk/news/live/cx20p2x9093t
220•cjr•11h ago•372 comments

jank is C++

https://jank-lang.org/blog/2025-07-11-jank-is-cpp/
230•Jeaye•14h ago•76 comments

Fundamentals of garbage collection (2023)

https://learn.microsoft.com/en-us/dotnet/standard/garbage-collection/fundamentals
36•b-man•3d ago•4 comments

Leveraging Elixir's hot code loading capabilities to modularize a monolithic app

https://lucassifoni.info/blog/leveraging-hot-code-loading-for-fun-and-profit/
42•ronxjansen•4d ago•1 comments

Dict Unpacking in Python

https://github.com/asottile/dict-unpacking-at-home
77•_ZeD_•3d ago•22 comments

Reverse proxy deep dive

https://medium.com/@mitendra_mahto/cross-posted-from-https-startwithawhy-com-reverseproxy-2024-01-15-reverseproxy-deep-dive-html-c3443dc3e0e5
16•miggy•3d ago•3 comments

Show HN: I built a toy music controller for my 5yo with a coding agent

https://github.com/jeffmccune/sonoserve
5•JeffMcCune•3d ago•1 comments

Andrew Ng: Building Faster with AI [video]

https://www.youtube.com/watch?v=RNJCfif1dPY
206•sandslash•1d ago•52 comments

Bill Atkinson's psychedelic user interface

https://patternproject.substack.com/p/from-the-mac-to-the-mystical-bill
402•cainxinth•20h ago•213 comments

HDD Clicker generates HDD clicking sounds, based on HDD Led activity

https://www.serdashop.com/HDDClicker
61•starkparker•7h ago•30 comments

Rice rebels: Research reveals grain's brewing benefits

https://phys.org/news/2025-06-rice-rebels-reveals-grain-brewing.html
7•PaulHoule•2d ago•2 comments

Upgrading an M4 Pro Mac mini's storage for half the price

https://www.jeffgeerling.com/blog/2025/upgrading-m4-pro-mac-minis-storage-half-price
338•speckx•17h ago•213 comments

Psilocybin shows promise as anti-aging therapy

https://neurosciencenews.com/psilocybin-longevity-aging-29425/
33•joak•2h ago•2 comments

Repaste Your MacBook

https://christianselig.com/2025/07/repaste-macbook/
205•speckx•18h ago•98 comments

Apple vs the Law

https://formularsumo.co.uk/blog/2025/apple-vs-the-law/
365•tempodox•1d ago•371 comments

OpenAI delays launch of open-weight model

https://twitter.com/sama/status/1943837550369812814
104•martinald•6h ago•72 comments

A software conference that advocates for quality

https://bettersoftwareconference.com/
78•leoncaet•9h ago•48 comments

Monorail – Turn CSS animations into interactive SVG graphs

https://muffinman.io/monorail/
66•stanko•3d ago•7 comments

Astronomers race to study interstellar interloper

https://www.science.org/content/article/astronomers-race-study-interstellar-interloper
115•bikenaga•15h ago•56 comments

Activeloop (YC S18) Is Hiring AI Search and Python Back End Engineers(Onsite,MV)

https://careers.activeloop.ai/
1•davidbuniat•10h ago

Introduction to Digital Filters (2024)

https://ccrma.stanford.edu/~jos/filters/
50•ofalkaed•12h ago•11 comments

AWS Free Tier Changes on July 15, 2025

https://freetier.co/articles/aws-free-tier-changes-july-15-2025
26•coop182•8h ago•23 comments

America's largest power grid is struggling to meet demand from AI

https://www.reuters.com/sustainability/boards-policy-regulation/americas-largest-power-grid-is-struggling-meet-demand-ai-2025-07-09/
40•riffraff•3h ago•14 comments

Measuring power network frequency using junk you have in your closet

https://halcy.de/blog/2025/02/09/measuring-power-network-frequency-using-junk-you-have-in-your-closet/
26•zdw•9h ago•5 comments

Show HN: RULER – Easily apply RL to any agent

https://openpipe.ai/blog/ruler
61•kcorbitt•13h ago•11 comments

At Least 13 People Died by Suicide Amid U.K. Post Office Scandal, Report Says

https://www.nytimes.com/2025/07/10/world/europe/uk-post-office-scandal-report.html
602•xbryanx•19h ago•496 comments
Open in hackernews

How to avoid P hacking

https://www.nature.com/articles/d41586-025-01246-1
116•benocodes•2mo ago

Comments

p4ul•2mo ago
If the conclusion is "be transparent", I'm strongly supportive.

And moreover, I would be even more supportive if we found a way to change the incentives for tenure and promotion such that reproducibility was an important factor in how we make decisions about grants, tenure, and promotion.

analog31•2mo ago
Just make it even more cutthroat than it already is. Replacing one hackable incentive system with another will just produce a new set of hacks.

Disclosure: I left academia before I had to worry about any of this.

neilv•2mo ago
> As any gambler knows, if you roll the dice often enough, eventually you’ll get the result you want by chance alone

You never count your results, when you're sitting at the lab bench, there will be time enough for counting, when the experiments are done.

boulos•2mo ago
Nicely done. Since many folks may not know the original song: https://en.m.wikipedia.org/wiki/The_Gambler_(song)

(And TIL, this wasn't original to Kenny Rogers!)

neilv•1mo ago
I almost did this verbatim quote of the lyrics, which paralleled the article's sentence, and is relevant to P-hacking, but it's the wrong advice:

    Every gambler knows
    That the secret to survivin'
    Is knowin' what to throw away
    And knowin' what to keep
saghm•1mo ago
I don't know, maybe knowing when to "hold them" versus "fold them" and "walk away" would be a valuable skill here. The phrasing sounds off in the part you quite because in poker you only can play a given hand once, and after you've lost, you need to draw an entirely new dataset and start fresh.
TorKlingberg•1mo ago
It depends on what Poker variant you're playing. These days Texas hold 'em is dominant, but Five-card draw used to be very common, especially for informal games.
saghm•1mo ago
I'm not sure I understand the relevance. My point is that you can only play a hand once, regardless of the variant, and after that you deal a new hand rather than getting to go back and make different decisions on the same hand.
smallmancontrov•2mo ago
It might be below the fold, but it looks like they're missing the most important p-hacking strategy of all: the dogshit null hypothesis. It's very reliable and it's the most common type of p-hacking that I see.

It's easy to create a dogshit null hypotheses by negligence or by "negligence" and it's easy to reject a dogshit null hypothesis by simply collecting enough data as it automatically crumbles on contact with the real world -- that's what makes it dogshit. One might hope that this would be caught by peer review (insist on controls!) but I see enough dogshit null hypotheses roaming around the literature that these hopes are about as realistic as fairy dust. In practice, the dogshit null hypothesis reins supreme, or more precisely it quietly scoots out of the way so that its partner in crime, the dogshit alternative hypothesis, can have an unwarranted moment in the spotlight.

aw1621107•2mo ago
> looks like they're missing the most important p-hacking strategy of all: the dogshit null hypothesis

Would you mind giving an example(s) of such and how it differs from a "good" null hypothesis?

gms7777•1mo ago
Null hypotheses are often idealized distributions that are mathematically convenient and are often over-simplifications of the distributions we'd expect if there were truly no effect (because the expected distributions are either intractable to work with, or irregular and unknown).

So for example, suppose you want to detect if there's unusual patterns in website traffic -- a bot attack or unexpected popularity spike. You look at page views per hour over several days, with the null hypothesis that page views are normally distributed, with constant mean and variance over time.

You run a test, and unsurprisingly, you get a really low p-value, because web traffic has natural fluctuations, it's heavier during the day, it might be heavier on weekends, etc.

The test isn't wrong -- it's telling you that this data is definitely not normally distributed with constant mean and variance. But it's also not meaningful because it's not actually answering the question you're asking.

nmca•2mo ago
This would be much better with an example
smallmancontrov•2mo ago
"I ran a t-test on the untreated / treated samples and the difference is significant! The treatment worked!"

...but the data table shows a clear trend over time across both groups because the samples were being irradiated by intense sunlight from a nearby window. The model didn't account for this possibility, so it was rejected, just not because the treatment worked.

That's a relatively trivial example and you can already imagine ways in which it could have occurred innocently and not-so-innocently. Most of the time it isn't so straightforward. The #1 culprit I see is failure to account for some kind of obvious correlation, but the ways in which a null hypothesis can be dogshit are as numerous and subtle as the number of possible statistical modeling mistakes in the universe because they are the same thing.

somenameforme•2mo ago
I think you're more observing an issue with experimental models not challenging a null hypothesis, than with poor null hypotheses themselves. In other words, papers creating experiments that don't actually challenge the hypothesis. There was a major example of this with COVID. A typical way observational studies assessed the efficacy of the vaccines was by looking at outcomes between normalized samples of nonvaccinated and vaccinated individuals who came to the hospital and seeing their overall outcomes. Unvaccinated individuals generally had worse outcomes, so therefore the vaccines must be effective.

This logic was used repeatedly, but it fails to account for numerous obvious biases. For instance unvaccinated people are generally going to be less proactive in seeking medical treatment, and so the average severity of a case that causes them to go to the hospital is going to be substantially greater than for a vaccinated individual, with an expectation of correspondingly worse overall outcomes. It's not like this is some big secret - most papers mentioned this issue (among many others) in the discussion, but ultimately made no effort to control for it.

skribb•1mo ago
I'm not great at stats so I don't understand this example. Wouldn't the sunlight affect both groups equally? How can an equal exposure to sunlight create a significant difference?
vharuck•1mo ago
If I understand the parent commenter, here's a common example from population-level statistics like public health:

"State X saw a mortality rate last year that was statistically significantly higher than the national rate. We should focus our intervention there."

The null hypothesis is that the risks of death are exactly the same in the state vs the nation. That may work with experimental sample sizes, but at the population level you'll often have massive sample sizes. A statistically significant difference is not interesting by itself. It's just the first hurdle to jump before even discussing the importance of the difference. But I've seen publications (especially data reports with sprinklings of discussion) focus entirely on statistical significant differences in narrative next to tables.

This isn't P-hacking an experiment, but it is abusing and misunderstanding statistical significance to make decisions.

shoo•2mo ago
see also: Andrew Gelman's blog

> The problem with p-hacking is not the "hacking," it’s the "p." Or, more precisely, the problem is null hypothesis significance testing, the practice of finding data which reject straw-man hypothesis B, and taking this as evidence in support of preferred model A.

https://statmodeling.stat.columbia.edu/2021/09/30/the-proble...

See also this post from 2014 with a discussion of Confirmationist and falsificationist approaches to reasoning in science: https://statmodeling.stat.columbia.edu/2014/09/05/confirmati...

> I understand falisificationism to be that you take the hypothesis you love, try to understand its implications as deeply as possible, and use these implications to test your model, to make falsifiable predictions. The key is that you’re setting up your own favorite model to be falsified.

> In contrast, the standard research paradigm in social psychology (and elsewhere) seems to be that the researcher has a favorite hypothesis A. But, rather than trying to set up hypothesis A for falsification, the researcher picks a null hypothesis B to falsify and thus represent as evidence in favor of A.

> As I said above, this has little to do with p-values or Bayes; rather, it’s about the attitude of trying to falsify the null hypothesis B rather than trying to trying to falsify the researcher’s hypothesis A.

> Take Daryl Bem, for example. His hypothesis A is that ESP exists. But does he try to make falsifiable predictions, predictions for which, if they happen, his hypothesis A is falsified? No, he gathers data in order to falsify hypothesis B, which is someone else’s hypothesis. To me, a research program is confirmationalist, not falsificationist, if the researchers are never trying to set up their own hypotheses for falsification.

> That might be ok—maybe a confirmationalist approach is fine, I’m sure that lots of important things have been learned in this way. But I think we should label it for what it is.

See also: Andrew Gelman and Eric Loken's 2014 "garden of forking paths" paper: https://sites.stat.columbia.edu/gelman/research/unpublished/...

gregwebs•2mo ago
This is one of the most disturbing articles I have seen related to reproducibility because it seems to imply that scientists don’t already know this.
a_bonobo•2mo ago
As a biologist all the field wants is p < 0.05. What it actually means is unnecessary. It's a hurdle to pass to have another paper on your CV.
pizlonator•2mo ago
The worst part about this:

> Running experiments until you get a hit

Is that it's literally what us software optimization engineers do. We keep writing optimizations until we find one that is a statistically significant speed-up.

Hence we are running experiments until we get a hit.

The only defense I know against this is to have a good perf CI. If your patch seemed like a speed-up before committing, but perf CI doesn't see the speed-up, then you just p-hacked yourself. But that's not even fool proof.

You just have to accept that statistics lie and that you will fool yourself. Prepare accordingly.

throwanem•2mo ago
Why is this bad for you? You're optimizing software, not trying to describe reality. Monte Carlo and Drunkard's Walk are fine.
analog31•2mo ago
You're churning the user experience for no reason. Maybe constant optimization churn is one of the reasons why UIs are so bad.
throwanem•2mo ago
Perf, though? If a perf optimization changes the UI noticeably other than by making it smoother or otherwise less janky, someone is lying to someone about what "performance" means. Likely though that be, we needn't embarrass ourselves by following the sad example.

No, UIs churn because when they get good and stay that way, PMs start worrying no one will remember what they're for. Cf. 90% of UI changes in iOS since about version 12.

babuloseo•2mo ago
I thought languages such as Rust and flamegraphs and etc were supposed to help us avoid doing all this testing and optimization right? Like I use the built in analysis tools that come with cargo and such and what I have on my os, tools like cutter or reverse engineering tools. Even on python I use the default or standard profiling and optimization tools, I wonder sometimes if I am not doing something enough if the default tools thats recommended should cover most edge cases and performance cases right?
pizlonator•2mo ago
Yeah!

And software ultimately fails at perfect composability. So if you add code that purports to be an optimization then that code most likely makes it harder to add other optimizations.

Not to mention bugs. Security bugs even

babuloseo•2mo ago
heck even the ai by default doesnt start with security from the models I have tested its really really weird.
cortesoft•2mo ago
Well, what is the test you are using to measure performance? Maybe the optimizations help performance in some cases and hurts performance in others... your test might not fully match all real world workloads.
jean_lannes•2mo ago
These seem like two different things. Testing many different optimizations is not the same experiment; it's many different experiments. The SE equivalent of the practice being described would be repeatedly benchmarking code without making any changes and reporting results only from the favorable runs.
pizlonator•2mo ago
Doesn’t matter if it’s the same experiment or not.

Say I’m after p<0.05. That means that if I try 40 different purported optimizations that are all actually neutral duds, one of them will seem like a speedup and one of them will seem like a slowdown, on average.

daveFNbuck•2mo ago
That's not p hacking. That's just the nature of p values. P hacking is when you do things to make a particular experiment more likely to show as a success.
doubletwoyou•2mo ago
what they’re referring to might be better put as applying a patch once and then running it 500 times until you get a benchmark thats better than baseline for some reason

which is understandably a bit more loony

pizlonator•2mo ago
Nah it could be 20 different patches.
starspangled•2mo ago
> Is that it's literally what us software optimization engineers do. We keep writing optimizations until we find one that is a statistically significant speed-up.

I don't think that is what it is saying. It is saying you would write one particular optimization (your hypothesis), and then you would run the experiment (measuring speed-up) multiple times until you see a good number.

It's fine to keep trying more optimizations and use the ones that have a genuine speedup.

Of course the real world is a lot more nuanced -- often times measuring the performance speed up involves hypothesis as well ("Does this change to the allocator improve network packet transmission performance?"), you might find that it does not, but you might run the same change on disk IO tests to see if it helps that case. That is presumably okay too if you're careful.

LegionMammal978•2mo ago
"Multiple times" doesn't have to mean "no modifications". Suppose the software is currently on version A. You think that changing it to a version B might make it more performant, so you implement and profile it. You find no difference, so you figure that your B implementation isn't good enough, and write a slight variation B', perhaps moving around some loops or function calls. If that makes no difference, you keep writing variations B'', B''', B'''', etc., until one of them finally comes out faster than version A. You finally declare that version B (when properly implemented) is better than version A, when you've really just tried a lot more samples.
starspangled•2mo ago
Well it does mean "no modifications" to the hypothesis, hypothesis being about performance of code A and B. Code B' would be a change.

It's just semantics, but the point is that the article wasn't saying the same thing OP was worried about. There's nothing wrong with testing B, B', B'', etc. until you find a significant performance improvement. You just wouldn't test B several times and take the last set of data when it looks good. Almost goes without saying really.

LegionMammal978•1mo ago
Sure, it may not be precise repetition, but my idea here is that none of B', B'', etc. are really different than B (they may even compile down to the exact same bytecode), they're just the same thing but written differently. And in fact, none of these are really faster than A, even if they're all "changes". But it's the same issue as any other form of p-hacking, where you keep trying more and more trivial B-variations until you eventually get the result that you're looking for, by random chance. (Cf. the example in xkcd 882, which does change the experimental protocol each time, but only trivially.)

There is, in fact, "something wrong" with this, which is what GP was pointing out. It's literally covered under "Playing with multiple comparisons" in TFA.

(Personally, to combat this, I've ignored the fancy p-values and resorted to the eyeball test of whether it very consistently produces a noticable speedup.)

babuloseo•2mo ago
how can I do this in python what modules?
bbertelsen•2mo ago
There's another cheeky example of this where you select a pseudo-random seed that makes your result significant. I have a personal seed, I use it in every piece of research that uses random number generation. It keeps me honest!
cypherpunks01•2mo ago
Like the old saying goes,

"It is difficult to get a researcher to stop P hacking, when his career depends on his not stopping P hacking."

bjornsing•2mo ago
Yeah that was kind of my feeling too while skimming through this: ”Good luck with that…”

It’s not a knowledge problem. It’s a vales and incentives problem.

WhitneyLand•1mo ago
It is an old saying, and I’m not sure there’s much use to it as it feels like a mitigation.

No doubt the system needs to change, but lots of careers benefit from cheating or unethical behavior. It doesn’t rationalize it or force a choice on anyone.

eviks•2mo ago
The irony of the article appearing in the "career" section when following its advice means you'll not have a career
gwerbret•2mo ago
> Stopping an experiment once you find a significant effect but before you reach your predetermined sample size is classic P hacking.

Although much of the article is basic common sense, and although I'm not a statistician, I had to seriously question the author's understanding of statistics at this point. The predetermined sample size (statistical power) is usually based on an assumption made about the effect size; if the effect size turns out to be much larger than you assumed, then a smaller sample size can be statistically sound.

Clinical trials very frequently do exactly this -- stop before they reach a predetermined sample size -- by design, once certain pre-defined thresholds have been passed. Other than not having to spend extra time and effort, the reasons are at least twofold: first, significant early evidence of futility means you no longer have to waste patients' time; second, early evidence of utility means you can move an effective treatment into practice that much sooner.

A classic example of this was with clinical trials evaluating the effect of circumcision on susceptibility to HIV infection; two separate trials were stopped early when interim analyses showed massive benefits of circumcision [0, 1].

In experimental studies, early evidence of efficacy doesn't mean you stop there, report your results, and go home; the typical approach, if the experiment is adequately powered, is to repeat it (three independent replicates is the informal gold standard).

[0]: https://pubmed.ncbi.nlm.nih.gov/17321310/

[1]: https://pubmed.ncbi.nlm.nih.gov/16231970/

coolcase•2mo ago
Sounds like a variable cost experiment. Each observation cost x$. Like an A/B split on Google ads. Why keep paying for A when you know B is better already.
rrr_oh_man•2mo ago
Google Optimize used to tell you to let an experiment run for one-two weeks (?), exactly because early strong results tend to not don't hold up in the long run.

-> https://en.wikipedia.org/wiki/Regression_toward_the_mean

dr_dshiv•1mo ago
Seasonality effects, too
nialse•2mo ago
Small samples have more variability than large samples and thus more often show spurious large effects.
coolcase•1mo ago
So you end up with a higher threshold for confidence at p<0.05 ot whatever you want p to be under. Comes out in the maths!

Toss a coin 10 times comes up heads 10 times. There is a 1 in 2^10 (approx 1000) that happens by chance for an unbiased coin.

I'm convinced it is biased.

20 times I am freaking convinced.

I don't need another 1000 tosses.

azan_•1mo ago
It’s more like you are supposed to toss 1000 times and after 500 tosses you get a lucky streak of 5 heads in a row and then decide to end experiment and conclude that coin is biased.
coolcase•1mo ago
Oh yeah. Don't do that! Look at all 500 tosses.
ekianjo•2mo ago
There is another reason to keep clinical trials as long as designed. To understand the safety and side effects implications.
parpfish•2mo ago
In lots of human studies, you can’t just stop at an arbitrary number of participants because you’ve counterbalanced manipulations to decorrelate potential confounders (e.g., which color stimulus is paired with reward, the order of trials).
hiddencost•2mo ago
https://commons.m.wikimedia.org/wiki/File:P-hacking_by_early...

The author is absolutely correct. Early stopping is a classic form of p hacking. See attached image for an illustration.

If you want to be rigorous, you can define criterion for early stopping such that it's not, but you require relatively stronger evidence.

Clinical trials that stop early do so typically at predefined times with higher significance thresholds.

mjburgess•1mo ago
The region where `p` hits the red line should be called "publish or perish".
bjornsing•2mo ago
There are of course statistical methods designed to support early stopping. But I don’t think you can use a regular p-test every day and decide to stop if p < 0.05. That’s something else.
AstralStorm•1mo ago
You use full both sided ANOVA F test with multiple comparison correction for that. Even these tests are sometimes not conservative enough, because the correction is a bit of a guess.

You will end up with much higher number of trials required to hit the P value than the version with predetermined number of trials and no stopping point by P.

Say, in a single variable single run ABX test, 8 is the usual number needed according to Fischer frequentist approach. If you do multiple comparison to hit 0.05 you need I believe 21 trials instead. (Don't quote me on that, compute your own Bayesian beta prior probability.)

The number of trials to differentiate from a fair coin is the typical comparison prior, giving a beta distribution. You're trying to set up a ratio between the two of them, one fitted to your data, the other null.

thelamest•1mo ago
The general topic and some specific ways to estimate a correction are described under this term: https://en.wikipedia.org/wiki/Sequential_analysis
jpeloquin•1mo ago
Multiple comparisons and sequential hypothesis testing / early stopping aren't the same problem. There might be a way to wrangle an F test into a sequential hypothesis testing approach, but it's not obvious (to me anyway) how one would do so. In multiple comparisons each additional comparison introduces a new group with independent data; in sequential hypothesis testing each successive test adds a small amount of additional data to each group so all results are conditional. Could you elaborate or provide a link?
pcrh•1mo ago
The distinction is between ‘data peeking’, i.e. repeatedly checking the p-value you've obtained and stopping if it falls below 0.05, and repeating assays in the light of new information. Such new information can relate to the distribution of the values, the expected effect size, or any other parameter that you did not know at the outset of the study.

In ‘data peeking’, the flaw is that if an assay is repeated often enough, one will eventually get a result that deviates far from the mean result. This is a natural consequence of the data having a normal distribution, i.e. not all results will be identical. It's the equivalent of getting six heads or tails in a row (which should happen at least once if you flip a coin 200 times), and then reporting your coin as biased.

Repeating an assay because the distribution of the data is not what you thought, or because the likely difference between means is smaller than you thought is a valid approach.

Source: Big little lies: a compendium and simulation of p-hacking strategies Angelika M. Stefan and Felix D. Schönbrodt

https://royalsocietypublishing.org/doi/10.1098/rsos.220346

dccsillag•1mo ago
No, it's generally not valid -- it will depend on the specifics of the test (especially if the test is valid only asymptotically). You need some method that supports sequential inference. Nowadays your best bet is probably some sort of anytime-valid method from the e-value literature https://en.wikipedia.org/wiki/E-values https://projecteuclid.org/journals/statistical-science/volum...
srean•1mo ago
> I had to seriously question the author's understanding of statistics at this point.

I think you may want to start the questioning closer to home.

Early stopping is fine as long as the test has been designed with the possibility of early stopping in mind and this possibility has been factored in the p - value formulation.

parpfish•2mo ago
I was heavily encouraged to do what would later be called “p-hacking”, but it looked different from what they describe here. This article describes p-hacks for people that aren’t into math/stats. I always ended up p hacking because I was into stats methods.

Somebody would say “here’s an old dataset that didn’t work out, I bet you can use one of those new stats methods you’re always reading about to find a cool effect!”, and then the fishing expedition takes off.

A couple weeks later you show off some cool effects that your new cutting edge results were able to extract from an old, useless dataset.

But instead of saying “that’s good pilot data, let’s see if it holds up with a new experiment”, you’re told “you can publish that! Keep this up and maybe you’ll be lucky enough to get a job someday!”

AstralStorm•1mo ago
The practice you describe is called data dredging though. The thing about it is that you do not know enough experimental design details to make sure it was all on the up, especially worse the older the dataset gets.

Normally when doing that you need a multiple comparison corrections and conservative stats. That won't get you published though, or if you do get published you won't get noticed except by someone running a meta analysis. Perhaps not even then. Usually you end up with negative results from reanalysis, evidence of tampering or small effect sizes.

And this does not that reliably detect dataset manipulation, p hacking on the part of experimenters or accidental violations of the protocol, not even necessarily if the data collection included measures to prevent it.

In short: you cannot 100% trust any dataset you did not make. Not even as part of the team that makes it.

nlitened•1mo ago
If you "dredge" any data set (even the one you can 100% trust) over and over with random hypotheses until p-value is <0.05, you will eventually (actually, pretty quickly) support some false hypothesis. That's why "data dredging" is also p-hacking.
karma_fountain•1mo ago
Yes, as I understand it there is bias inherent in any dataset due to the fact it is a sample. Data dredging is just looking for that bias. You could do that, but then you'd have to confirm with a new experiment.
TeeMassive•1mo ago
The bias towards positive hypotheses is a consequence of the lack of fundamental discoveries. Most scientific researchers at this point are publicly funded engineering projects with no expected ROI. This is not a bad thing per se, but the culture of research based around making an impression in some noble's court is no longer viable. The incentives need to be shifted to good research and good methodology and need to be results agnostic.
andrewla•1mo ago
As long as there is transparency about the process, I think this sort of thing is basically fine. It's roughly at the level of observational science rather than experimental science, and it can help lead to new research to validate the effect discovered.

Where this gets dangerous is when it is taken at face value, either in scientific circles, or, more common, journalistic circles.

godelski•1mo ago
I got my undergrad in physics and data hacking was discussed at length in every lab class. I don't know if this is a common experience but it was really one of the most beneficial lessons.

In be beginning it always felt obvious what hacking was or wasn't but towards the end it really felt hard to distinguish. I think that was the point. It created a lot of self doubt which led to high levels of scrutiny.

Later I worked as an engineer and saw frequent examples of errors you describe. One time another engineer asked if we could extrapolate data in a certain way, I said no and would likely lead to catastrophic failure. Lead engineer said I was being a perfectionist. Well, the rocket engine exploded during the second test fire, costing the company millions and years of work. The perfectionist label never stopped despite several instances (not to that scale). Any extra time and money to satisfy my "perfectionism" was greatly offset by preventable failures.

Later I went to grad school for CS and it doesn't feel much different. Academia, big tech, small tech, whatever. People think you plug data into algorithms and the result you get is all there is. But honestly, that's where the real work starts.

Algorithms aren't oracles and you need to deeply study them to understand their limits and flaws. If you don't, you get burned. But worse, often the flame is invisible. A lot of time and money is wasted trying to treat those fires and it's frequent for people to believe the only flames that exist are the obvious and highly visible ones.

thatcat•1mo ago
Any books on experiment design and analysis you'd recommend?
godelski•1mo ago
I'm not sure if there's a great universal book. Generally you learn this through the formal education and as parts of textbooks. I mean there are dedicated topics like Bayesian Experimental Design (might have "Optimal" in there) and similar subjects, but I'm not sure that's what you're looking for. One point of contention I've had when in grad school (CS) was about the lack of this training for CS students, especially in data analysis classes and ML. I'm not surprised students end up believing "output = correct".

These are topics you can generally learn on your own (maybe why no consolidated class?). The real key is to ask a bunch of questions about your metrics. Remember: all metrics are guides, as they aren't perfectly aligned with the thing you actually want to measure. You need to understand the divergence to understand when it works and when it doesn't. This can be tricky, but to get into the habit constantly ask yourself "what is assumed". There are always a lot of assumptions. Definitely not something usually not communicated well...

notpushkin•2mo ago
> You have full access to this article via your institution.

Huh. I’m not on a university connection or anything. Is it just open access?

spinf97•2mo ago
> Ending the experiment too early

> Running experiments until you get a hit

But if I'm running an experiment how do I know how many time to run it.

remus•2mo ago
Before you start your experiment, you calculate how many samples you need based on the estimated effect size you're looking for and how small you want your confidence interval to be.

Small effect with high confidence => more samples

Big effect with low confidence=> less samples

analog31•1mo ago
In the physical sciences you can often estimate the noise level in a null measurement -- or even measure it. You often do this just to get your setup working before doing something like wasting a precious specimen on a "this time for real" measurement.
zipy124•1mo ago
The Bonferroni correction part of this article is the most important. The amount of papers that don't account for this is shocking, comparing 20 variables with a 0.05 confidence interval is extremely annoying, as you end up having to do analysis on all papers data yourself to correct for it to see if it is still significant or not.
pcrh•1mo ago
>If you need statistics, you did the wrong experiment.

~Ernest Rutherford.

biofox•1mo ago
>If you don't need statistics, you did the wrong experiment.

~Psychologists

>What are statistics?

~Computer scientists

nlitened•1mo ago
Psychologists are notoriously bad at statistics though
perrygeo•1mo ago
It's not that they suck at statistics. It's that their statistics and experimental designs are artificially stuck in the dark ages. This is forced on the world by the academic publishing industry - you publish this way, or you perish. The completely unsurprising result is a reproducibility crisis that undermines the entire field. Check out "Bernoulli's Fallacy" for a good overview.

My theory isn't that Psychologists are bad at statistics. It's that the remaining problems involve lots of messy interactions and messy data that all but require statistical techniques. We just don't have the tools to extract obvious causality amidst such complexity.

BeetleB•1mo ago
Not really - it just shows up so much in psychology because they need statistics much more than, say, physics. Most physics programs in the US do not even teach statistics as a subject.
bossyTeacher•1mo ago
medicine, biomedical, economics, cancer biology have similar issues hence the reproducibility crisis in those fields
WhitneyLand•1mo ago
Reading this article tbh causes second hand embarrassment. Ostensibly it’s targeting professional scientists using the brand of a prestigious journal, yet it has a vibe of explaining ethics and common sense to school kids. We’ve come to the point of having to explain to PhDs why cherry picking data is bad.

I’m not criticizing the article, rather bemoaning the fact that it’s needed. Of course the problem is not just with the much maligned social sciences, it’s physics and computer science too. The controversy around Microsoft’s topological qubits, a super complex topic, in part involved the most basic kind of this nonsense, something like including 4 samples of 20 measured in the paper iirc.

The community needs to get its shit together. The world we’re living in now, the post truth era, is the result of many factors but this is one of them. The loss of faith in science is partially a self-inflicted wound.

andrewla•1mo ago
The article cuts off for me so I do not know if they talk about this, but preregistration has to be part of the conversation moving forward.

And it has to have teeth -- withdrawn studies have to have a reputational risk that affects the credibility of future studies, even if it means publishing a retrospective or a null result in a minor journal.

some_random•1mo ago
Imagine if Nature simply didn't publish obviously p-hacked papers. Perhaps that would do more than a blog post.
dimal•1mo ago
Won’t these just make it less likely that you can publish your work, and end up damaging your career in the short term? As opposed to getting published, having a career, with a long tail risk of being found out later?

And you could mitigate that risk by publishing research that doesn’t really matter, so no one ever checks.

freehorse•1mo ago
There are many more or less obvious ways that people do p-hacking without even realising it.

A classic one is looking at eg an eeg topographic plot, notice which areas or channels within an area seem to be more promising, and running stats and follow ups on these. There are of course degrees of these: people may have preregistered which area (let's say prefrontal cortex for example) but leave open which channels (because it is a bit hard to make that exact guesses anyway). There are methods to deal with this (eg cluster permutation analysis) but often people seem to think that they have to choose between averaging between too many channels, thus risking smoothening out and decreasing an existing effect, or cherry-picking channels based on visual inspection of the data, which means artificially increasing an existing effect or even creating an artifactual one. Because people do not actually run a test to pick the channels, they just visually inspect the data, they do not actually realise this is p-hacking. The problem is that determining the researcher's degrees of freedom is not an easy task, and not one that can just be formalised in a p-adjustment technique.

There is a huge spectrum of practices around these degrees of freedom, that may happen during any stage of the data processing, that range from obviously to subtly sketchy and problematic. And believe me that often people who do that think that they actually have good practices, and others do p-hacking.

Imo the main way to actually avoid this issue is actually being transparent with all the decisions one makes, even if this can reduce the faith on one's results (which actually should be the point of it, if that's the case!). A lot of time shit happens, and often it is hard to predict everything in advance in a preregistration. If the incentive was to just play safe then not much innovation and method experimentation would occur. It is easy to talk about preregistration as panacea in fields with long ago established practices, but much harder when the state of the art wrt both methods and theory may change wildly even in 2 years that may take to run a study.

I believe we need better frameworks for rigorous exploratory research. The only paper I have seen to actually take this idea seriously is this one [0], but I believe a lot of research would more honestly fit in such a framework, and not everything should be conceptualised within a hypothesis testing framework.

Method-wise, closed testing procedures also seem very interesting for such research (and can work both actually inferentially, but also for extracting hypotheses for further testing), such as [1].

[0] https://pmc.ncbi.nlm.nih.gov/articles/PMC7098547/

[1] https://openpharma.github.io/CTP/articles/closed_testing_pro...

ivan_ah•1mo ago
Non-paywall link: https://archive.is/IJcOI