`XOR[0...n] = 0 ^ 1 .... ^ n = [n, 1, n + 1, 0][n % 4]`
XOR[0...x] = (x&1^(x&2)>>1)+x*(~x&1)
Actually I found something through Gemini based on the table mod 4 idea in previous post. Thanks.
If you’re working on an architecture where a single multiplication and a bit shift is cheaper than N xor’s, and where xor, add, and sub are all the same cost, then you can get a performance win by computing the sum as N(N+1)/2; and you don’t need a blog post to understand why it works.
[n, 1, n+1, 0][n%4]
(https://en.wikipedia.org/wiki/Abelian_group -- I'll use ⋆ as the Abelian group's operation, and ~ for inversion, below)
I believe you are implying:
(g(1) ⋆ ... ⋆ g(n)) ⋆ ~(g(i(1)) ⋆ g(i(2)) ⋆ ... ⋆ g(i(n-1))) = g(m)
where "m" is the group element index that is missing from "i".
However, for this to work, it is requried that you can distribute the inversion ~ over the group operation ⋆, like this:
~(g(i(1)) ⋆ g(i(2)) ⋆ ... ⋆ g(i(n-1))) = ~g(i(1)) ⋆ ~g(i(2)) ⋆ ... ⋆ ~g(i(n-1))
because it is only after this step (i.e., after the distribution) that you can exploit the associativity and commutativity of operation ⋆, and reorder the elements in
g(1) ⋆ ... ⋆ g(n) ⋆ ~g(i(1)) ⋆ ~g(i(2)) ⋆ ... ⋆ ~g(i(n-1))
such that they pairwise cancel out, and leave only the "unmatched" (missing) element -- g(m)
However, where is it stated that inversion ~ can be distributed over group operation ⋆? The above wikipedia article does not spell that out as an axiom.
Wikipedia does mention "antidistributivity":
https://en.wikipedia.org/wiki/Distributive_property#Antidist...
(which does imply the distributivity in question here, once we restore commutativity); however, WP says this property is indeed used as an axiom ("in the more general context of a semigroup with involution"). So why is it not spelled out as one for Abelian groups?
I probably would have written it with a single loop, using the `enumerate` iterator adapter. But in Python, two loops is almost certainly more efficient.
Having only one loop gives you a better memory access pattern, because it's 2 XOR operations in between each memory access. Two loops is the same number of instructions, but it spends one loop ignoring memory and then another loop doing rapid-fire memory accesses. In python there's enough overhead that it's unlikely to matter. But in a faster language running on a busy machine that could make a real difference.
Whether it's a win in Python to use one or two loops isn't so clear, as a lot is hidden behind complex opcodes and opaque iterator implementations. Imperative testing might help, but a new interpreter version could change your results.
In any case, if we want to nitpick over performance we should be insisting on a parallel implementation to take advantage of the gobs of cores CPUs now have, but now we're on a micro-optimisation crusade and are ignoring the whole point of the article.
You seem to believe that "O(2n)"
for value in range(1, n + 1):
result ^= value
for value in A:
result ^= value
is slower than "O(n2)" for value in range(1, n + 1):
result ^= value
result ^= A[value-1]
simply because the latter has one "for loop" less. Am I misunderstanding you, or if not, why would this matter for speed?A "for" loop in Python isn't particularly cheap. It compiles to some static overhead to set up the iterator, then each loop iteration compiles to the "FOR_ITER" opcode, a "STORE_FAST" opcode to assign the iteration value to a variable, and the body of the loop.
"FOR_ITER" calls the "__next__()" method of the iterator (which is on top of the interpreter object stack), catches the StopIteration exception to know when to terminate the loop (by jumping past the loop body), and stores the iterator value to the top of the stack. What "__next__()" does is totally opaque - we don't know what kind of object A is - but since we've added the overhead of a function call already it wouldn't matter if it was a super tight bit of machine code, we're already paying a (relatively) hefty runtime cost.
A particularly bad implementation of "__next__()" for some custom iterable collection might be so stupid as to walk through the collection until it reaches the current index's item and returns that, so "for value in A" could in fact be O(n^2).
Plus, "result ^= A[value-1]" is substantially more work than "result ^= value", so even just on the loop bodies the two examples aren't very similar at all. Evaluating "A[value-1]" may wind up calling a "__getitem__()" method on A.
If A is, say, a linked list or binary tree, iterating it is very cheap but indexing it is O(n), so the second loop might be O(n^2) where the first is O(n).
So maybe we be a bit more Pythonic, and do:
for i, value in enumerate(A)
result ^= i
result ^= value
One loop, no indexing of A! But we've not actually saved anything: the __next__() method of enumerate's iterator will increment its index then call the __next__() method of A's iterator, (approximately) the same work as if we'd done two FOR_ITER, one for an index and one for A.Why would this matter for speed? I don't know. Unless 'n' is pretty big a human won't even notice the execution time of any of this code.
Each iteration of a for loop performs one index update and one termination comparison. For a simple body that is just an XOR, that's the difference between performing 5 operations (update, exit check, read array, XOR with value, XOR with index) per N elements in the one loop case versus 7 operations (update, exit, read array, XOR with value, then update, exit, XOR with index) in the two loop case. So we're looking at a 29% savings in operations.
It gets worse if the looping structure does not optimize to a raw, most basic for loop and instead constructs some kind of lazy collection iterator generalized for all kinds of collections it could iterate over.
The smaller the loop body, the higher the gains from optimizing the looping construct itself.
So you don't actually need the first loop (at least for the set of integers 1..n example), but bringing that up is probably out of scope for this article.
It's all theoretical though. On real world data sets that aren't small I don't see why you wouldn't hand these tasks off to C/Zig/Rust unless you're only running them once or twice.
The epitome of turning technical interviews into a trivia contest to make them feel smart. Because isn't that the point of a tech interview?
I haven't even read the article so I don't know what this is about really but if an interviewer seriously asked me about some obscure xor trick I'd laugh at them.
I believe you under-estimate what a good interviewer is trying to do with questions such as these:
Either you've seen the trick before and you get an opportunity to show the interviewer that you're an honest person by telling him you have. Huge plus and the interview can move on to other topics.
Either you haven't and you can demonstrate to the interviewer your analytical skills by dissecting the problem step by step and understanding what the code actually does and how.
Bonus if you can see the potential aliasing problem when used to swap two variables.
Not a trivia question at all.
And sometimes even faster than a load immediate, hence XOR AX, AX instead of MOV AX, 0.
Shorter usually mean faster, even if the instruction itself isn't faster.
> Shorter usually means faster
It depends, so spouting generalities doesn't mean anything. Instruction cache line filling vs. cycle reduction vs. reservation station ordering is typically a compiler constraints optimization problem(s).
Because Arm64 has a zero register, and Arm32 has small immediates, and all instructions are uniformly long.
Shorter basically means you can fit more in instruction cache, which should in theory improve performance marginally.
IIRC, Intel said a mov was the way to go for some now ancient x86 CPUs, though.
That is, one should also prove a ^ (b ^ c) = (a ^ b) ^ c. Instinctive, but non-trivial.
> If more than two elements are missing (or duplicated), then analyzing the individual bits fails because there are several combinations possible for both 0 and 1 as results. The problem then seems to require more complex solutions, which are not based on XOR anymore.
If you consider XOR to be a little bit more general, I think you can still use something like the partitioning algorithm. That is to say, considering XOR on a bit level behaves like XOR_bit(a,b)=a+b%2, you might consider a generalized XOR_bit(a,b,k)=a+b%k. With this I think you can decide partitions with up to k missing numbers, but I'm too tired to verify/implement this right now.
Basically xor swapping a[i] with a[j] triggered the evil logic when i was equal to j.
The state of RC4 consists of a random permutation of bytes. Whenever it outputs a value, it further permutes the state by swapping some bytes of the state. Th xor swap trick sets one of these values to zero, whenever RC4 attempts to swap the same item within the permutation. This gradually zeros out the state, until RC4 outputs the plaintext.
> We can thus search for u by applying this idea to one of the partitions and finding the missing element, and then find v by applying it to the other partition.
Since you already have u^v, you need only search for u, which immediately gives you v.
They're really evil on modern CPUs.
Example 1: Find the missing number
xor =: (16 + 2b0110) b.
iota1000 =: (i. 1000)
missingNumber =: (xor/ iota1000) xor (xor/ iota1000 -. 129)
echo 'The missing number is ' , ": missingNumber
This print 'The missing number is 129'Example 2: Using a random permutation, find the missing number.
permuted =: (1000 ? 1000)
missingNumber = (xor/ permuted) xor (xor/ permuted -. ? 1000)
Example 3: find the missing number in this matrix. _ (< 2 2) } 5 5 $ (25 ? 25)
12 9 1 20 19
6 18 3 4 8
24 7 _ 15 23
11 21 10 2 5
0 16 17 22 14
Final test: repeat 10 times the example 3 (random matrices) and collect the time it takes you to solve it in a list of times, then compute the linear regression best fit by times %. (1 ,. i. 10)
Did you get better at solving it by playing more times?I am not affiliated with J, but in case you want to try some J code there is a playground: https://jsoftware.github.io/j-playground/bin/html2/
Edited: It seems I am procrastinating a lot about something I have to do but don't want to.
This is nonsensical, where does the second truth table come from? Instead you just observe that, by definition, 1^0 == 0^1.
The XOR solution was a valid answer, but not the only answer we would have happily accepted.
The interview question was chosen such that it's very easy to understand and quick to solve, meaning it would indicate the candidate knew at least the basics of programming in C#. Almost surprisingly, we actually had candidates applying for "senior" level positions who struggled with this.
It could be solved in a multitude of ways, e.g:
- XOR as above
- Use of a HashSet<int>
- Use for loop and List which contains a number and its count.
- Use LINQ to group the numbers or something and then find the one with the count.
As long as what they did worked, it was a "valid" answer, we could then often discuss the chosen solution with the candidate and see how they reacted when we let them know of other valid solutions.
It was really great for not being a "one clever trick" question and could act as a springboard to slightly deeper discussions into their technical thought processes and understanding.
(n & ((n & 1) - 1)) + ((n ^ (n >> 1)) & 1)
Or a much more readable version [ n, 1, n + 1, 0 ][n % 4]
which makes it clear that this function cycles through a pattern of length four.Why this works can be seen if we start with some n that is divisible by four, i.e. hast the two least significant bits clear, and then keep XORing it with its successors. We start with xxxxxx00 which is our n. Then we XOR it with n + 1 which is xxxxxx01 and that clears all the x's and leaves us with 00000001. Now we XOR it with n + 2 which is xxxxxx10 and that yields xxxxxx11 which is n + 3. The cycle finishes when we now XOR it it with n + 3 which yields 00000000.
johnea•9h ago