frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

How to avoid P hacking

https://www.nature.com/articles/d41586-025-01246-1
48•benocodes•3d ago

Comments

p4ul•4h ago
If the conclusion is "be transparent", I'm strongly supportive.

And moreover, I would be even more supportive if we found a way to change the incentives for tenure and promotion such that reproducibility was an important factor in how we make decisions about grants, tenure, and promotion.

analog31•3h ago
Just make it even more cutthroat than it already is. Replacing one hackable incentive system with another will just produce a new set of hacks.

Disclosure: I left academia before I had to worry about any of this.

neilv•4h ago
> As any gambler knows, if you roll the dice often enough, eventually you’ll get the result you want by chance alone

You never count your results, when you're sitting at the lab bench, there will be time enough for counting, when the experiments are done.

boulos•2h ago
Nicely done. Since many folks may not know the original song: https://en.m.wikipedia.org/wiki/The_Gambler_(song)

(And TIL, this wasn't original to Kenny Rogers!)

smallmancontrov•4h ago
It might be below the fold, but it looks like they're missing the most important p-hacking strategy of all: the dogshit null hypothesis. It's very reliable and it's the most common type of p-hacking that I see.

It's easy to create a dogshit null hypotheses by negligence or by "negligence" and it's easy to reject a dogshit null hypothesis by simply collecting enough data as it automatically crumbles on contact with the real world -- that's what makes it dogshit. One might hope that this would be caught by peer review (insist on controls!) but I see enough dogshit null hypotheses roaming around the literature that these hopes are about as realistic as fairy dust. In practice, the dogshit null hypothesis reins supreme, or more precisely it quietly scoots out of the way so that its partner in crime, the dogshit alternative hypothesis, can have an unwarranted moment in the spotlight.

aw1621107•3h ago
> looks like they're missing the most important p-hacking strategy of all: the dogshit null hypothesis

Would you mind giving an example(s) of such and how it differs from a "good" null hypothesis?

nmca•3h ago
This would be much better with an example
smallmancontrov•3h ago
"I ran a t-test on the untreated / treated samples and the difference is significant! The treatment worked!"

...but the data table shows a clear trend over time across both groups because the samples were being irradiated by intense sunlight from a nearby window. The model didn't account for this possibility, so it was rejected, just not because the treatment worked.

That's a relatively trivial example and you can already imagine ways in which it could have occurred innocently and not-so-innocently. Most of the time it isn't so straightforward. The #1 culprit I see is failure to account for some kind of obvious correlation, but the ways in which a null hypothesis can be dogshit are as numerous and subtle as the number of possible statistical modeling mistakes in the universe because they are the same thing.

somenameforme•2h ago
I think you're more observing an issue with experimental models not challenging a null hypothesis, than with poor null hypotheses themselves. In other words, papers creating experiments that don't actually challenge the hypothesis. There was a major example of this with COVID. A typical way observational studies assessed the efficacy of the vaccines was by looking at outcomes between normalized samples of nonvaccinated and vaccinated individuals who came to the hospital and seeing their overall outcomes. Unvaccinated individuals generally had worse outcomes, so therefore the vaccines must be effective.

This logic was used repeatedly, but it fails to account for numerous obvious biases. For instance unvaccinated people are generally going to be less proactive in seeking medical treatment, and so the average severity of a case that causes them to go to the hospital is going to be substantially greater than for a vaccinated individual, with an expectation of correspondingly worse overall outcomes. It's not like this is some big secret - most papers mentioned this issue (among many others) in the discussion, but ultimately made no effort to control for it.

shoo•4h ago
see also: Andrew Gelman's blog

> The problem with p-hacking is not the "hacking," it’s the "p." Or, more precisely, the problem is null hypothesis significance testing, the practice of finding data which reject straw-man hypothesis B, and taking this as evidence in support of preferred model A.

https://statmodeling.stat.columbia.edu/2021/09/30/the-proble...

See also this post from 2014 with a discussion of Confirmationist and falsificationist approaches to reasoning in science: https://statmodeling.stat.columbia.edu/2014/09/05/confirmati...

> I understand falisificationism to be that you take the hypothesis you love, try to understand its implications as deeply as possible, and use these implications to test your model, to make falsifiable predictions. The key is that you’re setting up your own favorite model to be falsified.

> In contrast, the standard research paradigm in social psychology (and elsewhere) seems to be that the researcher has a favorite hypothesis A. But, rather than trying to set up hypothesis A for falsification, the researcher picks a null hypothesis B to falsify and thus represent as evidence in favor of A.

> As I said above, this has little to do with p-values or Bayes; rather, it’s about the attitude of trying to falsify the null hypothesis B rather than trying to trying to falsify the researcher’s hypothesis A.

> Take Daryl Bem, for example. His hypothesis A is that ESP exists. But does he try to make falsifiable predictions, predictions for which, if they happen, his hypothesis A is falsified? No, he gathers data in order to falsify hypothesis B, which is someone else’s hypothesis. To me, a research program is confirmationalist, not falsificationist, if the researchers are never trying to set up their own hypotheses for falsification.

> That might be ok—maybe a confirmationalist approach is fine, I’m sure that lots of important things have been learned in this way. But I think we should label it for what it is.

See also: Andrew Gelman and Eric Loken's 2014 "garden of forking paths" paper: https://sites.stat.columbia.edu/gelman/research/unpublished/...

gregwebs•4h ago
This is one of the most disturbing articles I have seen related to reproducibility because it seems to imply that scientists don’t already know this.
a_bonobo•3h ago
As a biologist all the field wants is p < 0.05. What it actually means is unnecessary. It's a hurdle to pass to have another paper on your CV.
pizlonator•3h ago
The worst part about this:

> Running experiments until you get a hit

Is that it's literally what us software optimization engineers do. We keep writing optimizations until we find one that is a statistically significant speed-up.

Hence we are running experiments until we get a hit.

The only defense I know against this is to have a good perf CI. If your patch seemed like a speed-up before committing, but perf CI doesn't see the speed-up, then you just p-hacked yourself. But that's not even fool proof.

You just have to accept that statistics lie and that you will fool yourself. Prepare accordingly.

throwanem•3h ago
Why is this bad for you? You're optimizing software, not trying to describe reality. Monte Carlo and Drunkard's Walk are fine.
analog31•3h ago
You're churning the user experience for no reason. Maybe constant optimization churn is one of the reasons why UIs are so bad.
throwanem•2h ago
Perf, though? If a perf optimization changes the UI noticeably other than by making it smoother or otherwise less janky, someone is lying to someone about what "performance" means. Likely though that be, we needn't embarrass ourselves by following the sad example.

No, UIs churn because when they get good and stay that way, PMs start worrying no one will remember what they're for. Cf. 90% of UI changes in iOS since about version 12.

babuloseo•1h ago
I thought languages such as Rust and flamegraphs and etc were supposed to help us avoid doing all this testing and optimization right? Like I use the built in analysis tools that come with cargo and such and what I have on my os, tools like cutter or reverse engineering tools. Even on python I use the default or standard profiling and optimization tools, I wonder sometimes if I am not doing something enough if the default tools thats recommended should cover most edge cases and performance cases right?
pizlonator•2h ago
Yeah!

And software ultimately fails at perfect composability. So if you add code that purports to be an optimization then that code most likely makes it harder to add other optimizations.

Not to mention bugs. Security bugs even

babuloseo•1h ago
heck even the ai by default doesnt start with security from the models I have tested its really really weird.
cortesoft•2h ago
Well, what is the test you are using to measure performance? Maybe the optimizations help performance in some cases and hurts performance in others... your test might not fully match all real world workloads.
jean_lannes•3h ago
These seem like two different things. Testing many different optimizations is not the same experiment; it's many different experiments. The SE equivalent of the practice being described would be repeatedly benchmarking code without making any changes and reporting results only from the favorable runs.
pizlonator•2h ago
Doesn’t matter if it’s the same experiment or not.

Say I’m after p<0.05. That means that if I try 40 different purported optimizations that are all actually neutral duds, one of them will seem like a speedup and one of them will seem like a slowdown, on average.

daveFNbuck•1h ago
That's not p hacking. That's just the nature of p values. P hacking is when you do things to make a particular experiment more likely to show as a success.
doubletwoyou•2h ago
what they’re referring to might be better put as applying a patch once and then running it 500 times until you get a benchmark thats better than baseline for some reason

which is understandably a bit more loony

pizlonator•2h ago
Nah it could be 20 different patches.
starspangled•2h ago
> Is that it's literally what us software optimization engineers do. We keep writing optimizations until we find one that is a statistically significant speed-up.

I don't think that is what it is saying. It is saying you would write one particular optimization (your hypothesis), and then you would run the experiment (measuring speed-up) multiple times until you see a good number.

It's fine to keep trying more optimizations and use the ones that have a genuine speedup.

Of course the real world is a lot more nuanced -- often times measuring the performance speed up involves hypothesis as well ("Does this change to the allocator improve network packet transmission performance?"), you might find that it does not, but you might run the same change on disk IO tests to see if it helps that case. That is presumably okay too if you're careful.

LegionMammal978•1h ago
"Multiple times" doesn't have to mean "no modifications". Suppose the software is currently on version A. You think that changing it to a version B might make it more performant, so you implement and profile it. You find no difference, so you figure that your B implementation isn't good enough, and write a slight variation B', perhaps moving around some loops or function calls. If that makes no difference, you keep writing variations B'', B''', B'''', etc., until one of them finally comes out faster than version A. You finally declare that version B (when properly implemented) is better than version A, when you've really just tried a lot more samples.
babuloseo•1h ago
how can I do this in python what modules?
bbertelsen•1h ago
There's another cheeky example of this where you select a pseudo-random seed that makes your result significant. I have a personal seed, I use it in every piece of research that uses random number generation. It keeps me honest!
cypherpunks01•3h ago
Like the old saying goes,

"It is difficult to get a researcher to stop P hacking, when his career depends on his not stopping P hacking."

bjornsing•1h ago
Yeah that was kind of my feeling too while skimming through this: ”Good luck with that…”

It’s not a knowledge problem. It’s a vales and incentives problem.

eviks•2h ago
The irony of the article appearing in the "career" section when following its advice means you'll not have a career
gwerbret•2h ago
> Stopping an experiment once you find a significant effect but before you reach your predetermined sample size is classic P hacking.

Although much of the article is basic common sense, and although I'm not a statistician, I had to seriously question the author's understanding of statistics at this point. The predetermined sample size (statistical power) is usually based on an assumption made about the effect size; if the effect size turns out to be much larger than you assumed, then a smaller sample size can be statistically sound.

Clinical trials very frequently do exactly this -- stop before they reach a predetermined sample size -- by design, once certain pre-defined thresholds have been passed. Other than not having to spend extra time and effort, the reasons are at least twofold: first, significant early evidence of futility means you no longer have to waste patients' time; second, early evidence of utility means you can move an effective treatment into practice that much sooner.

A classic example of this was with clinical trials evaluating the effect of circumcision on susceptibility to HIV infection; two separate trials were stopped early when interim analyses showed massive benefits of circumcision [0, 1].

In experimental studies, early evidence of efficacy doesn't mean you stop there, report your results, and go home; the typical approach, if the experiment is adequately powered, is to repeat it (three independent replicates is the informal gold standard).

[0]: https://pubmed.ncbi.nlm.nih.gov/17321310/

[1]: https://pubmed.ncbi.nlm.nih.gov/16231970/

coolcase•2h ago
Sounds like a variable cost experiment. Each observation cost x$. Like an A/B split on Google ads. Why keep paying for A when you know B is better already.
rrr_oh_man•1h ago
Google Optimize used to tell you to let an experiment run for one-two weeks (?), exactly because early strong results tend to not don't hold up in the long run.

-> https://en.wikipedia.org/wiki/Regression_toward_the_mean

nialse•33m ago
Small samples have more variability than large samples and thus more often show spurious large effects.
ekianjo•2h ago
There is another reason to keep clinical trials as long as designed. To understand the safety and side effects implications.
parpfish•1h ago
In lots of human studies, you can’t just stop at an arbitrary number of participants because you’ve counterbalanced manipulations to decorrelate potential confounders (e.g., which color stimulus is paired with reward, the order of trials).
hiddencost•1h ago
https://commons.m.wikimedia.org/wiki/File:P-hacking_by_early...

The author is absolutely correct. Early stopping is a classic form of p hacking. See attached image for an illustration.

If you want to be rigorous, you can define criterion for early stopping such that it's not, but you require relatively stronger evidence.

Clinical trials that stop early do so typically at predefined times with higher significance thresholds.

bjornsing•1h ago
There are of course statistical methods designed to support early stopping. But I don’t think you can use a regular p-test every day and decide to stop if p < 0.05. That’s something else.
parpfish•2h ago
I was heavily encouraged to do what would later be called “p-hacking”, but it looked different from what they describe here. This article describes p-hacks for people that aren’t into math/stats. I always ended up p hacking because I was into stats methods.

Somebody would say “here’s an old dataset that didn’t work out, I bet you can use one of those new stats methods you’re always reading about to find a cool effect!”, and then the fishing expedition takes off.

A couple weeks later you show off some cool effects that your new cutting edge results were able to extract from an old, useless dataset.

But instead of saying “that’s good pilot data, let’s see if it holds up with a new experiment”, you’re told “you can publish that! Keep this up and maybe you’ll be lucky enough to get a job someday!”

notpushkin•1h ago
> You have full access to this article via your institution.

Huh. I’m not on a university connection or anything. Is it just open access?

spinf97•39m ago
> Ending the experiment too early

> Running experiments until you get a hit

But if I'm running an experiment how do I know how many time to run it.

remus•21m ago
Before you start your experiment, you calculate how many samples you need based on the estimated effect size you're looking for and how small you want your confidence interval to be.

Small effect with high confidence => more samples

Big effect with low confidence=> less samples

Firefox Moves to GitHub

https://github.com/mozilla-firefox/firefox
55•thefilmore•38m ago•7 comments

Fastvlm: Efficient vision encoding for vision language models

https://github.com/apple/ml-fastvlm
192•nhod•4h ago•37 comments

Open Hardware Ethernet Switch project, part 1

https://serd.es/2025/05/08/Switch-project-pt1.html
79•luu•3d ago•10 comments

TransMLA: Multi-head latent attention is all you need

https://arxiv.org/abs/2502.07864
25•ocean_moist•2h ago•1 comments

Air Traffic Control

https://computer.rip/2025-05-11-air-traffic-control.html
132•1317•1d ago•34 comments

15 Years of Shader Minification

https://www.ctrl-alt-test.fr/2025/15-years-of-shader-minification/
28•laurentlb•2d ago•3 comments

The Barbican

https://arslan.io/2025/05/12/barbican-estate/
475•farslan•14h ago•162 comments

A conversation about AI for science with Jason Pruet

https://www.lanl.gov/media/publications/1663/0125-qa-jason-pruet
140•LAsteNERD•10h ago•113 comments

Can you trust that permission pop-up on macOS?

https://wts.dev/posts/tcc-who/
239•nmgycombinator•11h ago•166 comments

RIP Usenix ATC

https://bcantrill.dtrace.org/2025/05/11/rip-usenix-atc/
156•joecobb•13h ago•34 comments

Understanding LucasArts' iMUSE System

https://github.com/meshula/LabMidi/blob/main/LabMuse/imuse-technical.md
95•todsacerdoti•7h ago•18 comments

HealthBench – An evaluation for AI systems and human health

https://openai.com/index/healthbench/
140•mfiguiere•12h ago•123 comments

NASA study reveals Venus crust surprise

https://science.nasa.gov/science-research/astromaterials/nasa-study-reveals-venus-crust-surprise/
62•mnem•3d ago•67 comments

Launch HN: ParaQuery (YC X25) – GPU Accelerated Spark/SQL

111•winwang•14h ago•66 comments

A community-led fork of Organic Maps

https://www.comaps.app/news/2025-05-12/3/
289•maelito•18h ago•190 comments

University of Texas-led team solves a big problem for fusion energy

https://news.utexas.edu/2025/05/05/university-of-texas-led-team-solves-a-big-problem-for-fusion-energy/
231•signa11•17h ago•157 comments

Reviving a modular cargo bike design from the 1930s

https://www.core77.com/posts/136773/Reviving-a-Modular-Cargo-Bike-Design-from-the-1930s
158•surprisetalk•15h ago•130 comments

Ruby 3.5 Feature: Namespace on read

https://bugs.ruby-lang.org/issues/21311
192•ksec•16h ago•91 comments

Wtfis: Passive hostname, domain and IP lookup tool for non-robots

https://github.com/pirxthepilot/wtfis
75•todsacerdoti•7h ago•4 comments

Policy of Transience

https://www.chiark.greenend.org.uk/~sgtatham/quasiblog/transience/
22•pekim•2d ago•0 comments

FedRAMP 20x – One Month in and Moving Fast

https://www.fedramp.gov/2025-04-24-fedramp-20x-one-month-in-and-moving-fast/
72•transpute•5h ago•50 comments

Writing N-body gravity simulations code in Python

https://alvinng4.github.io/grav_sim/5_steps_to_n_body_simulation/
96•dargscisyhp•2d ago•21 comments

Show HN: Lumoar – Free SOC 2 tool for SaaS startups

https://www.lumoar.com
70•asdxrfx•10h ago•28 comments

The Beam

https://www.erlang-solutions.com/blog/the-beam-erlangs-virtual-machine/
56•Alupis•3d ago•10 comments

Why the 737 MAX has been such a headache for Boeing

https://www.jalopnik.com/1853477/boeing-737-max-incidents-aircraft-problems/
22•cebert•2h ago•34 comments

Demonstrably Secure Software Supply Chains with Nix

https://nixcademy.com/posts/secure-supply-chain-with-nix/
98•todsacerdoti•15h ago•58 comments

Universe expected to decay in 10⁷⁸ years, much sooner than previously thought

https://phys.org/news/2025-05-universe-decay-years-sooner-previously.html
193•pseudolus•20h ago•251 comments

Continuous glucose monitors reveal variable glucose responses to the same meals

https://examine.com/research-feed/study/1jjKq1/
174•Matrixik•2d ago•99 comments

System lets robots identify an object's properties through handling

https://news.mit.edu/2025/system-lets-robots-identify-objects-properties-through-handling-0508
4•mikhael•3d ago•1 comments

Build your own Siri locally and on-device

https://thehyperplane.substack.com/p/build-your-own-siri-locally-on-device
137•andreeamiclaus•10h ago•31 comments