P-Hacking in Startups

https://briefer.cloud/blog/posts/p-hacking/

79•thaisstein•3d ago

Comments

derektank•3h ago

I don't have any first hand experience with customer facing startups, SaaS or otherwise. How common is rigorous testing in the first place?

dayjah•3h ago

As you scale it improves. More often at a small scale you ask users and they’ll give you invaluable information. As you scale you abstract folks into buckets. At about 1million MAU I’ve found A/B testing and p-value starts to make sense.

bcyn•3h ago

Great read, thanks! Could you dive a little deeper into example 2 & pre-registration? Conceptually I understand how the probability of false positives increases with the number of variants.

But how does a simple act such as "pre-registration" change anything? It's not as if observing another metric that already existed changes anything about what you experimented with.

PollardsRho•3h ago

If you have many metrics that could possibly be construed as "this was what we were trying to improve", that's many different possibilities for random variation to give you a false positive. If you're explicit at the start of an experiment that you're considering only a single metric a success, it turns any other results you get into "hmm, this is an interesting pattern that merits further exploration" and not "this is a significant result that confirms whatever I thought at the beginning."

It's basically a variation on the multiple comparisons, but sneakier: it's easy to spend an hour going through data and, over that time, test dozens of different hypotheses. At that point, whatever p-value you'd compute for a single comparison isn't relevant, because after that many comparisons you'd expect at least one to have uncorrected p = 0.05 by random chance.

noodletheworld•2h ago

There are many resources that will explain this rigorously if you search for the term “p-hacking”.

The TLDR as I understand it is:

All data has patterns. If you look hard enough, you will find something.

How do you tell the difference between random variance and an actual pattern?

It’s simple and rigorously correct to only search the data for a single metric; other methods, eg. Bonferroni correction (divide p by k) exist, but are controversial (1).

Basically, are you a statistician? If not, sticking to the best practices in experimentation means your results are going to be meaningful.

If you see a pattern in another metric, run another experiment.

[1] - https://pmc.ncbi.nlm.nih.gov/articles/PMC1112991/

andy99•3h ago

> Imagine you're a product manager trying to optimize your website’s dashboard. Your goal is to increase user signups.

This would be Series B or later right? I don't really feel like it's a core startup behavior.

irq-1•3h ago

1−(1−0.05)^9=64 (small mistake; should be ^20)

simonw•3h ago

On the one hand, this is a very nicely presented explanation of how to run statistically significant A/B style tests.

It's worth emphasizing though that if your startup hasn't achieved product market fit yet this kind of thing is a huge waste of time! Build features, see if people use them.

noodletheworld•2h ago

“This kind of thing” being running AB tests at all.

There’s no reason to run AB / MVT tests at all if you’re not doing them properly.

shoo•2h ago

related book: Trustworthy Online Controlled Experiments

https://experimentguide.com/

Jemaclus•2h ago

> This isn't academic nit-picking. It's how medical research works when lives are on the line. Your startup's growth deserves the same rigor.

But does it, really? A lot of companies sell... well, let's say "not important" stuff. Most companies don't cost peoples' lives when you get it wrong. If you A/B test user signups for a startup that sells widgets, people aren't living or dying based on the results. The consequences of getting it wrong are... you sell fewer widgets?

While I understand the overall point of the post -- and agree with it! -- I do take issue with this particular point. A lot of companies are, arguably, _too rigorous_ when it comes to testing.

At my last company, we spent 6 weeks waiting for stat sig. But within 48 hours, we had a positive signal. Conversion was up! Not statistically significant, but trending in the direction we wanted. But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.

Note: I'm not advocating stopping tests as soon as something shows trending in the right direction. The third scenario on the post points this out as a flaw! I do like their proposal for "peeking" and subsequent testing.

But, really, let's just be realistic about what level of "rigor" is required to make decisions. We aren't shooting rockets into space. We're shipping software. We can change things if we get them wrong. It's okay. The world won't end.

IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals. If its goals are "stat sig on every test," then sure, treat it like someone might die if you're wrong. (I would argue that you have the wrong goals, in this case, but I digress...)

But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.

epgui•2h ago

It does, if you assume you care about the validity of the results or about making changes that improve your outcomes.

The degree of care can be different in less critical contexts, but then you shouldn’t lie to yourself about how much you care.

renjimen•2h ago

But there’s an opportunity cost that needs to be factored in when waiting for a stronger signal.

Nevermark•2h ago

One solution is to gradually move instances to you most likely solution.

But continue a percentage of A/B/n testing as well.

This allows for a balancing of speed vs. certainty

imachine1980_•1h ago

do you use any tool for this, or simply crunk up slightly the dial each day

travisjungroth•1h ago

There are multi armed bandit algorithms for this. I don’t know the names of the public tools.

This is especially useful for something where the value of the choice is front loaded, like headlines.

brian-armstrong•1h ago

The thing is though, you're just as likely to be not improving things.

I think we can realize another reason to just ship it. Startups need to be always moving. You need to keep turning the wheel to help keep everyone busy and keep them from fretting about your slow growth or high churn metrics. Startups need lots of fighting spirit. So it's still probably better to ship it rather than admit defeat and suffer bad vibes.

travisjungroth•49m ago

Completely agree. The sign up flow for your startup does not need the same rigor as medical research. You don’t need transportation engineering standards for your product packaging, either. They’re just totally different levels of risk.

I could write pages on this (I’ve certainly spoken for hours) but the adoption of a scientific research mindset is very limiting for A/B testing. You don’t need all the status quo bias of null hypothesis testing.

At the same time, it’s quite impressive how people are able to adapt. An organization experienced with A/B testing will start doing things like multi variate correction in their heads.

For anyone spinning this stuff up, go Bayesian from the start. You’ll end up there, whether you realize it or not. (People will look at p-values in consideration of prior evidence).

0.05 (or any Bayesian equivalent) is not a magic number. It’s really quite high for a default. Harder sciences (the ones not in replication crisis) use much stricter values by default.

Adjust the confidence required to the cost of the change and the risk of harm. If you’re at the point of testing, the cost of change may be zero (content). It may be really high, it may be net negative!

But in most cases, at a startup, you should be going after wins that are way more impactful and end up having p-values lower than 0.05, anyway. This is easy to say, but don’t waste your time coming up with methods to squeeze out more signal. Just (just lol) make better changes to your product so that the methods don’t matter. If p=0.00001, that’s going to be a better signal than p=0.05 with every correction in this article.

If you’re going to pick any fanciness from the start (besides Bayes) make it anytime valid methods. You’re certainly already going to be peaking (as you should) so have your data reflect that.

cckolon•2h ago

Example 01 is basically the “green jellybeans cause acne” problem

https://xkcd.com/882/

kylecazar•2h ago

I like the points and I'll probably link to this.

I'll add one from my experience as a PM dealing with very "testy" peers in early stage startups: don't do any of this if you don't have {enough} users -- rely on intuition and focus on the core product.

ryan-duve•2h ago

Good news: no p-value threshold needs to be passed to switch from one UI layout to another. As long as they all cost the same amount of money to host/maintain/whatever, the point estimate is sufficient. The reason is, at the end of the day, some layout has to be shown, and if each option had an equal number of visitors during the test, you can safely pick the one with the most signups.

When choosing one of several A/B test options, a hypothesis test is not needed to validate the choice.

ec109685•1h ago

Yes, but assuming it was enhancing something already there, it was all pointless work.

ec109685•1h ago

Keep in mind that Frequent A/B tests burn statistical “credit.” Any time you ship a winner at p = 0.05 you’ve spent 5 % of your false-positive budget. Do that five times in a quarter and the chance at least one is noise is 1 – 0.95⁵ ≈ 23 %.

There are several approaches you can take to reduce that source of error:

Quarterly alpha ledger

Decide how much total risk you want this quarter (say 10 %). Divide the remaining α by the number of experiments left and make that the threshold for the next launch. Forces the “is this button-color test worth 3 % of our credibility?” conversation. More info: “Sequential Testing in Practice: Why Peeking Is a Problem and How to Fix It” (https://medium.com/@aisagescribe/sequential-testing-in-pract...).

Benjamini–Hochberg (BH) for metric sprawl

Once you watch a dozen KPIs, Bonferroni buries real lifts. BH ranks all the p-values at the end, then sets the cut so that, say, only 5 % of declared winners are false positives. You keep power, and you can run the same BH step on the primary metric from every experiment each quarter to catch lucky launches. More info: “Controlling False Discoveries: A Guide to BH Correction in Experimentation” (https://www.statsig.com/perspectives/controlling-false-disco...).

Bayesian shrinkage + 5 %

“ghost” control for big fleets FAANG-scale labs run hundreds of tests and care about 0.1 % lifts. They pool everything in a simple hierarchical model; noisy effects get pulled toward the global mean, so only sturdy gains stay above water. Before launch, they sanity-check against a small slice of traffic that never saw any test. Cuts winner’s-curse inflation by ~30 %. Clear explainer: “How We Avoid A/B Testing Errors with Shrinkage” (https://eng.wealthfront.com/2015/10/29/how-we-avoid-ab-testi...) and (https://www.statsig.com/perspectives/informed-bayesian-ab-te...)

<10 tests a quarter: alpha ledger or yolo; dozens of tests and KPIs: BH; hundreds of live tests: shrinkage + ghost control.

akoboldfrying•9m ago

> the chance at least one is noise is 1 – 0.95⁵ ≈ 23 %

Yes, but that's not really the big deal that you're making it out to be, since it's (usually) not an all-or-nothing thing. Usually, the wins are additive. The chance of each winner being genuine is still 95% (assuming no p-hacking), and so the expected number of wins out of those 5 will be be 0.95 * 5 = 4.75 wins (by linearity of expectation), which is a solid win rate.

tmoertel•1h ago

When reading this article, be aware that there are some percent signs missing, and their absence might cause confusion. For example:

> After 9 peeks, the probability that at least one p-value dips below 0.05 is: 1 − (1 − 0.05)^9 = 37.

There should be a percent sign after that 37. (Probabilities cannot be greater than one.)

blobbers•1h ago

1 - (1-0.95)^9 = 64

Did they generate this blog post with AI? That math be hallucinating. Don’t need a calculator to see that.

blobbers•1h ago

I’m so confused by the math in this article. It’s also not 37. I can’t be the only person scratching their head.

Retric•46m ago

Probably mangled the expression. (0.95)^9 = 63%, 1 - (0.95)^9 = 37%

Hmm, (1 - (1-0.95))^9 also = 63%. No idea why 64, closest I can see is 1-(0.95)^20 or 1-(1-0.05)^20 = 64%.

vzaliva•1h ago

I would look in the directon of SPRT:

https://en.wikipedia.org/wiki/Sequential_probability_ratio_t...

esafak•24m ago

If you like reading blogs, I suggest Spotify's: https://engineering.atspotify.com/2023/03/choosing-sequentia...

akoboldfrying•4m ago

Yes! (Correct) pre-registration is everything. ("Correct" meaning: There's no point "pre-registering" if you fail to account for the number of tests you'll do -- but hopefully the fact that you have thought to pre-register at all is a strong indication that you should be performing such corrections.)

That said, I agree with the other poster here about how important this really is for startups. It's critical to know if the drug really improves lung function; it's probably not critical to know whether the accent colour on your landing page should be mauve or aqua blue.

P-Hacking in Startups

LaborBerlin: State-of-the-Art 16mm Projector

The bad boy of bar charts: William Playfair

Requiem for a Solar Plant

Denmark's Archaeology Experiment Is Paying Off in Gold and Knowledge

U.S. has bombed Fordo nuclear plant in attack on Iran

Airpass – easily overcome WiFi time limits

See Jane 128 by Arktronics run (ft. Magic Desk, 3-Plus-1 and the Thomson MO5)

Type Inference Zoo

AllTracker: Efficient Dense Point Tracking at High Resolution

Axolotls May Hold the Key to Regrowing Limbs

Samsung embeds IronSource spyware app on phones across WANA

Compact Representations for Arrays in Lua [pdf]

Show HN: Luna Rail – treating night trains as a spatial optimization problem

Tell HN: Beware confidentiality agreements that act as lifetime non competes

Scaling our observability platform by embracing wide events and replacing OTel

Using Microsoft's New CLI Text Editor on Ubuntu

Compiler for the B Programming Language

Phoenix.new – Remote AI Runtime for Phoenix

Debunking NIST's calculation of the Kyber-512 security level (2023)

Death to WYSIWYG!

Unexpected security footguns in Go's parsers

AI is ushering in a 'tiny team' era

The Nyanja new PC-Engine/TurboGrafx 16-bit console game in development

Weave (YC W25) is hiring a founding AI engineer

'Gwada negative': French scientists find new blood type in woman

uBlock Origin Lite Beta for Safari iOS

Balatro for the Nintendo E-Reader

Delta Chat is a decentralized and secure messenger app

Show HN: MMOndrian

P-Hacking in Startups

LaborBerlin: State-of-the-Art 16mm Projector

The bad boy of bar charts: William Playfair

Requiem for a Solar Plant

Denmark's Archaeology Experiment Is Paying Off in Gold and Knowledge

U.S. has bombed Fordo nuclear plant in attack on Iran

Airpass – easily overcome WiFi time limits

See Jane 128 by Arktronics run (ft. Magic Desk, 3-Plus-1 and the Thomson MO5)

Type Inference Zoo

AllTracker: Efficient Dense Point Tracking at High Resolution

Axolotls May Hold the Key to Regrowing Limbs

Samsung embeds IronSource spyware app on phones across WANA

Compact Representations for Arrays in Lua [pdf]

Show HN: Luna Rail – treating night trains as a spatial optimization problem

Tell HN: Beware confidentiality agreements that act as lifetime non competes

Scaling our observability platform by embracing wide events and replacing OTel

Using Microsoft's New CLI Text Editor on Ubuntu

Compiler for the B Programming Language

Phoenix.new – Remote AI Runtime for Phoenix

Debunking NIST's calculation of the Kyber-512 security level (2023)

Death to WYSIWYG!

Unexpected security footguns in Go's parsers

AI is ushering in a 'tiny team' era

The Nyanja new PC-Engine/TurboGrafx 16-bit console game in development

Weave (YC W25) is hiring a founding AI engineer

'Gwada negative': French scientists find new blood type in woman

uBlock Origin Lite Beta for Safari iOS

Balatro for the Nintendo E-Reader

Delta Chat is a decentralized and secure messenger app

Show HN: MMOndrian

P-Hacking in Startups

Comments