The reason I'm not putting % signs on there is that, until we normalize, those are measures and not probabilities. What that means is that an events which has a 16% chance of happening in the entire universe of possibility has a "area" or "volume" (the strictly correct term being measure) of 0.16. Once we zoom in to a smaller subset of events, it no longer has a probability of 16% but the measure remains unchanged.
In this previous comment I gave a longer explanation of the intuition behind measure theory and linked to some resources on YouTube.
I think there's an annoying thing where by saying "hey, here's this neat problem, what's the answer" I've made you much more likely to actually get the answer!
What I really wanted to do was transfer the experience of writing a simulation for a related problem, observing this result, assuming I had a bug in my code, and then being delighted when I did the math. But unfortunately I don't know how to transfer that experience over the internet :(
(to be clear, I'm totally happy you wrote out the probabilities and got it right! Just expressing something I was thinking about back when I wrote this blog)
As I was writing the simulation I realized my error. I finished the simulation anyway, just because, and it has the expected 80% result on both of them.
My error: when we trust "both" we're also trusting Alice, which means that my case was exactly the same as just trusting Alice.
PS as I was writing the simulation I did a small sanity test of 9 rolls: I rolled heads 9 times in a row (so I tried it again with 100 million and it was a ~50-50 split). There goes my chance of winning the lottery!
I wrote a quick colab to help visualize this, adds a little intuition for what's happening: https://colab.research.google.com/drive/1EytLeBfAoOAanVNFnWQ...
F T (Alice)
F xx ????????
xx ????????
T ?? vvvvvvvv
?? vvvvvvvv
^ ?? vvvvvvvv
B ?? vvvvvvvv
o ?? vvvvvvvv
b ?? vvvvvvvv
v ?? vvvvvvvv
?? vvvvvvvv
(where "F" describes cases where the specified person tells you a Falsehood, and "T" labels the cases of that person telling you the Truth)In the check-mark (v) region, you get the right answer regardless; they are both being truthful, and of course you trust them when they agree. Similarly you get the wrong answer regardless in the x region.
In the ? region you are no better than a coin flip, regardless of your strategy. If you unconditionally trust Alice then you win on the right-hand side, and lose on the left-hand side; and whatever Bob says is irrelevant. The situation for unconditionally trusting Bob is symmetrical (of course it is; they both act according to the same rules, on the same information). If you choose any other strategy, you still have a 50-50 chance, since Alice and Bob disagree and there is no reason to choose one over the other.
Since your odds don't change with your strategy in any of those regions of the probability space, they don't change overall.
When it comes to the wisdom of crowds, see https://egtheory.wordpress.com/2014/01/30/two-heads-are-bett...
In the general case of n intermediate occasional liars, the odds of the final result being accurate goes to 50% as n grows large, which makes sense, as it will have no correlation anymore to the initial input.
import random
def lying_flippers(num_flips=1_000_000):
"""
- Bob flips a coin and tells Alice the result but lies 20% of the
time.
- Alice tells me Bob's result but also lies 20% of the time.
- If I trust Bob, I know I'll be correct 80% of the time.
- If I trust Alice, how often will I be correct (assuming I don't
know Bob's result)?
"""
# Invert flip 20% of the time.
def maybe_flip_flip(flip: bool):
if random.random() < 0.2:
return not flip
return flip
def sum_correct(actual, altered):
return sum(1 if a == b else 0 for (b, a) in zip(actual, altered))
half_num_flips = num_flips // 2
twenty_percent = int(num_flips * 0.2)
actual_flips = [random.choice((True, False)) for _ in range(num_flips)]
num_heads = sum(actual_flips)
num_tails = num_flips - num_heads
print(f"Heads = {num_heads} Tails = {num_tails}")
bob_flips = [maybe_flip_flip(flip) for flip in actual_flips]
alice_flips = [maybe_flip_flip(flip) for flip in bob_flips]
bob_num_correct = sum_correct(actual_flips, bob_flips)
bob_percent_correct = bob_num_correct / num_flips
alice_num_correct = sum_correct(actual_flips, alice_flips)
alice_percent_correct = alice_num_correct / num_flips
# Trusting Bob should lead to being correct ~80% of the time.
# This is just a verification of the model since we already know the answer.
print(f"Trust Bob -> {bob_percent_correct:.1%}")
# Trusting Alice should lead to being correct ?% of the time.
# This model produces 68%.
print(f"Trust Alice -> {alice_percent_correct:.1%}")
print()Slide Alice's accuracy down to 99% and, again, if you don't trust Alice, you're no better off trusting Bob.
Interestingly, this also happens as a feature of them being independent. If Bob told the truth 20% of the time that Alice told a lie, or if Bob simply copied Alice's response 20% of the time and otherwise told the truth, then the maths are different.
But the chronometers are will sync with each other if you don't store them apart, which would result correlated noise that an average won't fix.
In a perfect world they drift less than a minute per day and you’re relatively close to the time with an average or just by picking one and knowing that you don’t have massive time skew.
I believe this saying was first made about compasses which also had mechanical failures. Having three lets you know which one failed. The same goes for mechanical watches, which can fail in inconsistent ways, slow one day and fast the next is problematic the same goes for a compass that is wildly off, how do you know which one of the two is off?
A minute per day would be far too much drift for navigation, wouldn't it?
From Wikipedia [1]:
> For every four seconds that the time source is in error, the east–west position may be off by up to just over one nautical mile as the angular speed of Earth is latitude dependent.
That makes me think a minute might be your budget for an entire voyage? But I don't know much about navigation. And it is beside the point, your argument isn't changed if we put in a different constant, so I only mention out of interest.
> Having three lets you know which one failed.
I guess I hadn't considered when it stops for a minute and then continues ticking steadily, and you would want to discard the measurement from the faulty watch.
But if I just bring one watch as the expression councils, isn't that even worse? I don't even know it malfunctioned and if it failed entirely I don't have any reference for the time at the port.
My interpretation had been that you look back and forth between the watches unable to make a decision, which doesn't matter if you always split the difference, but I see your point.
A well serviced rolex in 2026 with laser cut gears drifts +/- 15sec per day.
One with hand filed gears is going to be +/- a minute on a good day, and that’s what early navigation was using. I have watches with hand filed gears and they can be a bit rough.
Prior to that, it was dead reckoning, dragging a string every now and again to calculate speed and heading and the current and then guesstimating your location on a twice daily basis.
Those two wildly inaccurate systems mapped most of the world for us.
Though not without significant errors, the most amusing to me being that islands had a tendency to multiply because different maps would be combined and the cartographer would mistake the same island on two maps as being separate islands due to errors. A weird case of aliasing I suppose.
Modern Rolex (and Omega et al) are more like +/-2s.
"The precision of a COSC-certified chronometer must be between -4 and +6 seconds per day."
15s per day is not unusual for a good mechanical watch.
Even that was much better than the dead-reckoning they had to do in bluewater before working chronometers were invented. Your ship's "position" would be a triangle that might have sides ten miles long at lower latitudes.
If the chronometer error rate is 1%, averaging two will give you a 2% error rate.
You wouldn't be well served by averaging a measurement with a 1% error and a measurement with a 90% error, but you will have still have less than or equal to 90% error in the result.
If the errors are correlated, you could end up with a 1% error still. The degenerate case of this is averaging a measurement with itself. This is something clocks are especially prone to; if you do not inertially isolate them, they will sync up [1]. But that still doesn't result in a greater error.
You could introduce more error if you encountered precision issues. Eg, you used `(A+B)/2` instead of `A/2 + B/2`; because floating point has less precision for higher numbers, the former will introduce more rounding error. But that's not a function of the clocks, that's a numerics bug. (And this is normally encountered when averaging many measurements rather than two.)
There are different ways to define error but this is true whether you consider it to be MSE or variance.
The average of a right and a wrong clock is wrong. Half as wrong as the wrong one, but still wrong.
If this is a good mental model for dealing with clock malfunctions depends on the failure modes of the clocks.
The result in the original article only applies when there are discrete choices. For stuff you can actually average, more is always better.
Oh, and even with discrete choices (like heads vs tails), if you had to give a distribution and not just a single highest likelihood outcome, and we'd judge you by the cross-entry, then going from one to two is an improvement. And going from odd n to the next even n is an improvement in general in this setting.
Or bring only two, but step on one immediately, to get rid of the cursed pair situation, and also to get the clumsiness out of the way early. Old sailor's trick.
It's possible to navigate without being able to measure your longitude. Like if you're looking for an island, you should first navigate to the correct latitude and then sail along that latitude until you hit the island. The route is longer, obviously. But that's what you should do if your chronometers disagree.
This saying must originate with a landlubber...
I didnt't math during the thinking pause, but my intuition was a second liar makes it worse (more likey to end up 50-50 situation) and additional liars make it better as you get to reduce noise.
Is there a scenario where the extra liar makes it worse, you would be better yelling lalalallala as they tell you the answer?
So much of this breaks down when the binary nature of the variables involved becomes continuous or at least nonbinary.
It's an example of a more general interest of mine, how structural characteristics of an inferential scenario affect the value of information that is received.
I could also see this being relevant to diagnostic scenarios hypothetically.
P(A|AAAA) = p^4
P(A|BBBB) = (1-p)^4
Anyway, the apparent strangeness of the tie case comes from the fact that the binomial PMF is symmetric with respect to n (the number of participants) and n-k. PMF = (n choose k) * p^k * (1-p)^(n-k)
So when k = n/2, the symmetry means that the likelihood is identical under p and 1-p, so we're not gaining any information. This is a really good illustration of that; interesting post! (edit: apparently i suck at formatting)Instead of three independent signals, you'd evaluate: given how Alice and Bob usually interact, does their agreement/disagreement pattern here tell you something? (E.g., if they're habitual contrarians, their agreement is the signal, not their disagreement.)
Take it further: human + LLM collaboration, where you measure the ongoing conversational dynamics—tone shifts, productive vs. circular disagreement, what gets bypassed, how contradictions are handled. The quality of the collaborative process itself becomes your truth signal.
You're not just aggregating independent observations anymore; you're reading the substrate of the interaction. The conversational structure as diagnostic.
at 4 heads, just randomly select a jury of 3. and you're back on track.
at a million heads, just sum up all their guesses, divide by one million, and then check the over/under of 0.50
He wrote:
If our number N of friends is odd, our chances of guessing correctly don’t improve when we move to N+1 friends.
Either they agree, or they disagree.
If they agree, they're either both telling the truth or both lying. All you can do is go with what they agreed on. In this case, picking what they agreed on is the same as picking what one of them said (say, Alice).
If they disagree, then one is telling the truth and one is lying and you have no way to tell which. So just pick one, and it makes no difference if you pick the same one every time (say, Alice).
So you end up just listening to Alice all the time anyway.
Now replace fail with lying and you have the exact same problem.
Anyway if a single observer who lies 20% of the time gives you 4 out of 5 bits correct, but you don't know which ones...
And N such observers, where N>2, gives you a very good way of getting more information (best-of-3 voting etc), to the limit, at infinite observers, of a perfect channel...
then interpolating for N=2, there is more information here than for N=1. It just needs more advanced coding to exploit.
One example of this is in airplanes.
Time critical scenarios are one possibility.
In a safety critical scenario intermittent sensor failure might be possible but keep in mind that consistent failure is too.
A jury scenario is presumably one of consistent failure. There's no reason to expect that an intentional liar would change his answer upon being asked again.
I think why it feels odd is that most people intuitively answer a different question. If you had to bet on an outcome then Alice and bob agreeing gives you more information. But here you're not dealing with that question, you're either right and wrong; and whether or not Alice & Bob agree, you're effectively "wagering the same" in both cases (where your wager is 0.8, the probability [or expectation] that one is correct).
Basically what you're doing is breaking down p(correct) = p(correct & agree) + p(correct & disagree) where former is 0.8*0.8 and latter is 0.8*0.2. Explicitly computing the conditional probability however makes calculating more difficult: p(correct | agree)*p(agree) + p(correct | disagree)*p(disagree). This is something like (16/17) * (0.8*0.8 + 0.2*0.2) + 0.5 * (0.8*0.2*2) which is not easy to arrive at intuitively unless you grind through the calculation.
So _conditioned_ on them agreeing you are right ~94% while conditioned on them disagreeing it's a coin-toss (because when they disagree exactly one is right, and it's equally likely to be alice or bob). Interesting case where the unconditional probability is actually more intuitive and easier than the conditional.
"A:T, B:T - chances - H 6.0% | T 94.0% | occurs 34.0% of the time"
By the simplest of math for unrelated events, the chance of both A & B lying about the coin is 20% of 20%, or .2 * .2 = 0.04, or 4.0% ...
The "Let's prove it" section contains the correct analysis, including that our chance of being correct is 80% with two friends.
The code output for three players is similarly flawed, and the analysis slight misstates our chance of being correct as 90.0% (correctly: 89.6%).
Or am I missing something about the intent or output of the Python simulation?
But no, the python output is correct (although I do round the values). It's counterintuitive but these are two different questions:
1. What are the odds that both players lie? (4%)
2. Given that both players say tails, what are the odds that the coin is heads (~6%)
Trivially, the answer for question (1) is 0.2 * 0.2 = 4%The answer for question (2) is 0.02 / 0.34 = 6%
One way of expressing this is Bayes Rule: we want P(both say tails | coin is heads):
* we can compute this as (P(coin is heads | both say tails) * P(coin is heads)) / P(both say tails)
* P(coin is heads | both say tails) = 0.04 (both must lie)
* P(coin is heads) = 0.5
* P(both say tails) = 0.04 * 0.5 + 0.64 * 0.5 = 0.34
This gives us (0.04 * 0.5) / 0.34 = 0.02 / 0.34 ~= 6%I think that might not be convincing to you, so we can also just look at the results for a hypothetical simulation with 2000 flips:
* of those 2000 flips, 1000 are tails
* 640 times both players tell the truth
* 40 times both players lie
* 680 times (640 + 40) both players *agree*
* 320 times the players disagree
We're talking about "the number of times they lie divided by the number of times that they agree"40 / 680 ~= 6%
We go from 4% to 6% because the denominator changes. For the "how often do they both lie" case, our denominator is "all of our coin flips." For the "given that they both said tails, what are the odds that the coin is heads" case, our denominator is "all of the cases where they agreed" - a substantially smaller denominator!
The three players example is just me rounding 89.6% to 90% to make the output shorter (all examples are rounded to two digits, otherwise I found that the output was too large to fit on many screens without horizontal scrolling).
millipede•3w ago
zahlman•3w ago