But how does a simple act such as "pre-registration" change anything? It's not as if observing another metric that already existed changes anything about what you experimented with.
It's basically a variation on the multiple comparisons, but sneakier: it's easy to spend an hour going through data and, over that time, test dozens of different hypotheses. At that point, whatever p-value you'd compute for a single comparison isn't relevant, because after that many comparisons you'd expect at least one to have uncorrected p = 0.05 by random chance.
The TLDR as I understand it is:
All data has patterns. If you look hard enough, you will find something.
How do you tell the difference between random variance and an actual pattern?
It’s simple and rigorously correct to only search the data for a single metric; other methods, eg. Bonferroni correction (divide p by k) exist, but are controversial (1).
Basically, are you a statistician? If not, sticking to the best practices in experimentation means your results are going to be meaningful.
If you see a pattern in another metric, run another experiment.
This would be Series B or later right? I don't really feel like it's a core startup behavior.
It's worth emphasizing though that if your startup hasn't achieved product market fit yet this kind of thing is a huge waste of time! Build features, see if people use them.
There’s no reason to run AB / MVT tests at all if you’re not doing them properly.
But does it, really? A lot of companies sell... well, let's say "not important" stuff. Most companies don't cost peoples' lives when you get it wrong. If you A/B test user signups for a startup that sells widgets, people aren't living or dying based on the results. The consequences of getting it wrong are... you sell fewer widgets?
While I understand the overall point of the post -- and agree with it! -- I do take issue with this particular point. A lot of companies are, arguably, _too rigorous_ when it comes to testing.
At my last company, we spent 6 weeks waiting for stat sig. But within 48 hours, we had a positive signal. Conversion was up! Not statistically significant, but trending in the direction we wanted. But to "maintain rigor," we waited 6 weeks before turning it... and the final numbers were virtually the same as the 48 hour numbers.
Note: I'm not advocating stopping tests as soon as something shows trending in the right direction. The third scenario on the post points this out as a flaw! I do like their proposal for "peeking" and subsequent testing.
But, really, let's just be realistic about what level of "rigor" is required to make decisions. We aren't shooting rockets into space. We're shipping software. We can change things if we get them wrong. It's okay. The world won't end.
IMO, the right framing here is: your startup deserves to be as rigorous as is necessary to achieve its goals. If its goals are "stat sig on every test," then sure, treat it like someone might die if you're wrong. (I would argue that you have the wrong goals, in this case, but I digress...)
But if your goals are "do no harm, see if we're heading in the right vector, and trust that you can pivot if it turns out you got a false positive," then you kind of explicitly don't need to treat it with the same rigor as a medical test.
The degree of care can be different in less critical contexts, but then you shouldn’t lie to yourself about how much you care.
But continue a percentage of A/B/n testing as well.
This allows for a balancing of speed vs. certainty
This is especially useful for something where the value of the choice is front loaded, like headlines.
I think we can realize another reason to just ship it. Startups need to be always moving. You need to keep turning the wheel to help keep everyone busy and keep them from fretting about your slow growth or high churn metrics. Startups need lots of fighting spirit. So it's still probably better to ship it rather than admit defeat and suffer bad vibes.
I could write pages on this (I’ve certainly spoken for hours) but the adoption of a scientific research mindset is very limiting for A/B testing. You don’t need all the status quo bias of null hypothesis testing.
At the same time, it’s quite impressive how people are able to adapt. An organization experienced with A/B testing will start doing things like multi variate correction in their heads.
For anyone spinning this stuff up, go Bayesian from the start. You’ll end up there, whether you realize it or not. (People will look at p-values in consideration of prior evidence).
0.05 (or any Bayesian equivalent) is not a magic number. It’s really quite high for a default. Harder sciences (the ones not in replication crisis) use much stricter values by default.
Adjust the confidence required to the cost of the change and the risk of harm. If you’re at the point of testing, the cost of change may be zero (content). It may be really high, it may be net negative!
But in most cases, at a startup, you should be going after wins that are way more impactful and end up having p-values lower than 0.05, anyway. This is easy to say, but don’t waste your time coming up with methods to squeeze out more signal. Just (just lol) make better changes to your product so that the methods don’t matter. If p=0.00001, that’s going to be a better signal than p=0.05 with every correction in this article.
If you’re going to pick any fanciness from the start (besides Bayes) make it anytime valid methods. You’re certainly already going to be peaking (as you should) so have your data reflect that.
I'll add one from my experience as a PM dealing with very "testy" peers in early stage startups: don't do any of this if you don't have {enough} users -- rely on intuition and focus on the core product.
When choosing one of several A/B test options, a hypothesis test is not needed to validate the choice.
There are several approaches you can take to reduce that source of error:
Quarterly alpha ledger
Decide how much total risk you want this quarter (say 10 %). Divide the remaining α by the number of experiments left and make that the threshold for the next launch. Forces the “is this button-color test worth 3 % of our credibility?” conversation. More info: “Sequential Testing in Practice: Why Peeking Is a Problem and How to Fix It” (https://medium.com/@aisagescribe/sequential-testing-in-pract...).
Benjamini–Hochberg (BH) for metric sprawl
Once you watch a dozen KPIs, Bonferroni buries real lifts. BH ranks all the p-values at the end, then sets the cut so that, say, only 5 % of declared winners are false positives. You keep power, and you can run the same BH step on the primary metric from every experiment each quarter to catch lucky launches. More info: “Controlling False Discoveries: A Guide to BH Correction in Experimentation” (https://www.statsig.com/perspectives/controlling-false-disco...).
Bayesian shrinkage + 5 %
“ghost” control for big fleets FAANG-scale labs run hundreds of tests and care about 0.1 % lifts. They pool everything in a simple hierarchical model; noisy effects get pulled toward the global mean, so only sturdy gains stay above water. Before launch, they sanity-check against a small slice of traffic that never saw any test. Cuts winner’s-curse inflation by ~30 %. Clear explainer: “How We Avoid A/B Testing Errors with Shrinkage” (https://eng.wealthfront.com/2015/10/29/how-we-avoid-ab-testi...) and (https://www.statsig.com/perspectives/informed-bayesian-ab-te...)
<10 tests a quarter: alpha ledger or yolo; dozens of tests and KPIs: BH; hundreds of live tests: shrinkage + ghost control.
Yes, but that's not really the big deal that you're making it out to be, since it's (usually) not an all-or-nothing thing. Usually, the wins are additive. The chance of each winner being genuine is still 95% (assuming no p-hacking), and so the expected number of wins out of those 5 will be be 0.95 * 5 = 4.75 wins (by linearity of expectation), which is a solid win rate.
> After 9 peeks, the probability that at least one p-value dips below 0.05 is: 1 − (1 − 0.05)^9 = 37.
There should be a percent sign after that 37. (Probabilities cannot be greater than one.)
Did they generate this blog post with AI? That math be hallucinating. Don’t need a calculator to see that.
Hmm, (1 - (1-0.95))^9 also = 63%. No idea why 64, closest I can see is 1-(0.95)^20 or 1-(1-0.05)^20 = 64%.
https://en.wikipedia.org/wiki/Sequential_probability_ratio_t...
That said, I agree with the other poster here about how important this really is for startups. It's critical to know if the drug really improves lung function; it's probably not critical to know whether the accent colour on your landing page should be mauve or aqua blue.
derektank•3h ago
dayjah•3h ago