Is it not much simpler to parallelize by having different "readers" (using the same model parameters/weights) process different parts of the corpus in parallel? reader A is reading book A, while reader B is reading book B etc...?
Is there a deeper reason why more complicated parallelization as in the OP or the article it references is more desirable?
TimorousBestie•57m ago
The single-thread performance of the parallel prefix sum that they use is O(N log N), so the improvement from that to O(log N) on N threads is not as surprising.
The way the headline is written, it sounds like Amdahl’s law was violated. It wasn’t, of course.
casta•40m ago
How's the prefix sum on a single thread O(N log(N))? Isn't it trivially O(N)? It's just a for loop.
TimorousBestie•13m ago
Yes, but for loop comes with all those data dependencies that prevent it from being parallelized trivially.
The algorithm with fewer data dependencies is O(N log N).
DoctorOetker•19h ago
Is there a deeper reason why more complicated parallelization as in the OP or the article it references is more desirable?