The issues that are presented in this article mostly related to tensors of rank >2. Numpy is originally just matrices so it is not surprising that it has problems in this domain. A dedicated library like Torch is certainly better. But Torch is difficult in its own ways. IDK, I guess the author's conclusion that numpy is “the worst array language other than all the other array languages” feels correct. Maybe a lack of imagination on my part.
I've done only very simple ML stuff using R and it never feels like exactly the right tool.
R is used mostly as a fancy calculator, which is fine in itself, but it makes the comparisons to general purpose languages like Python quite apples-to-oranges.
Doing the lower level stuff that NumPy is used for is really mindfuck territory in R.
Also: TIL you can apparently use multi-dimensional NumPy arrays as NumPy array indexers, and they don't just collapse into 1-dimensional iterables. I expected `A[:,i,j,:]` not to work, or to be the same as if `j` were just `(0, 1)`. But instead, it apparently causes transposition with the previous dimension... ?
Even code you write now, you may need to GPU accelerate later, as your simulations grow.
Falling back on loops is against the entire reason of using NumPy in the first place.
But the point is still that the main purpose in building it was to be performant. To be accelerated. Even if that's not why you're personally using it.
I mean, I use my M4 Mac's Spotlight to do simple arithmetic. That's not the main point in building the M4 chip though.
(And that's before you even consider GPUs.)
It's not just python adding interpretor overhead, you also risk creating a lot of temporary arrays i.e. costly mallocs and memcopies.
* Optimized GPU code
* CPU vectorized code
* Static CPU unvectorized code
* Dynamic CPU code
where the last one refers to the fact that a language like Python, in order to add two numbers together in its native, pure-Python mode, does a lot of boxing, unboxing, resolving of class types and checking for overrides, etc.Each of those is at least an order of magnitude slower than the next one up the hierarchy, and most of them appreciably more than one. You're probably closer to think of them as more like 1.5 orders of magnitude as a sort of back-of-the-envelope understanding.
Using NumPy incorrectly can accidentally take you from the top one, all the way to the bottom one, in one fell swoop. That can be a big deal, real quick. Or real slow, as the case may be.
In more complicated scenarios, it matters how much computation is going how far down that hierarchy. If by "processing a video frame by frame" you mean something like "I wrote a for loop on the frames but all the math is still in NumPy", you've taken "iterating on frames" from the top to the bottom, but who cares, Python can iterate on even a million things plenty quickly, especially with everything else that is going on. If, by constrast, you mean that at some point you're iterating over each pixel in pure Python, you just fell all the way down that hierarchy for each pixel and you're in bigger trouble.
In my opinionated opinion, the trouble isn't so much that it's possible to fall down that stack. That is arguably a feature, after all; surely we should have the capability of doing that sort of thing if we want. The problem is how easy it is to do without realizing it, just by using Python in what looks like perfectly sensible ways. If you aren't a systems engineer it can be hard to tell you've fallen, and even if you are honestly the docs don't make it particularly easy to figure out.
though what does "static cpu" vs "dynamic cpu" mean? it's one thing to be pointer chasing and missing the cache like OCaml can, it's another to be running a full interpreter loop to add two numbers like python does
It could be a 12 hour run vs. 12000000 hour run.
NumPy is the lingua franca for storing and passing arrays in memory in Python.
Thank you NumPy!
Given what inconsistent mess Matlab is aside from matrix manipulation, I take NumPy any day.
I have several gripes with NumPy, or more broadly the notion of using Python to call a C/asm library that vectorizes math operations. A lot of people speak of NumPy like the solution to all your high-performance math needs in Python, which I think is disingenuous. The more custom logic you do, the less suitable the tools get. Pure Python numeric code is incredibly slow - like 1/30× compared to Java - and as you find parts that can't be vectorized, you have to drop back down to pure Python.
I would like to give the simple example of the sieve of Eratosthenes:
def sieve_primeness(limit):
result = [False] + [True] * limit
for i in range(2, len(result)):
if result[i]:
for j in range(i * i, len(result), i):
result[j] = False
This code is scalar, and porting it to a language like C/C++/Rust/Java gives decent performance straight out of the box. Performance in Python is about 1/30× the speed, which is not acceptable.People pretend that you can hand-wave the performance problem away by using NumPy. Please tell me how to vectorize this Python code. Go on, I'm waiting.
You can't vectorize the `if result[i]` because that controls whether the inner for-loop executes; so it must execute in pure Python. For the inner loop, you can vectorize that by creating a huge temporary array and then AND it with the result array, but that is extremely memory-inefficient compared to flipping bits of the result array in place, and probably messes up the cache too.
Alternatively, you can run the code in PyPy, but that usually gives a speed-up of maybe 3×, which is nowhere near enough to recover the 30× speed penalty.
Another example is that NumPy is not going to help you implement your own bigint library, because that also requires a lot of custom logic that executes between loop iterations. Meanwhile, I've implemented bigint in pure Java with acceptable performance and without issues.
Again, my point is that anytime you venture into custom numerical/logical code that is not the simple `C = A + B`, you enter a world of pain when it comes to Python performance or vectorization.
import numpy as np
def sieve_primes(limit):
nums = np.arange(limit + 1)
nums[1] = 0
primes = []
for i in range(2, limit):
if nums[i] != 0: # could just be `if nums[i]:` since 0 is false-y
primes.append(i)
nums = nums * (np.mod(nums, i) != 0)
print(primes)
sieve_primes(1000)
This takes advantage of the fact that True and False are treated as 1 and 0 in Python for the multiplication.EDIT: The other solutions people have shown are closer to the sieve itself than my solution. I was having trouble with the slicing notation, now that I've seen how that should be done I'd redo the above as:
def sieve_primes(limit):
nums = np.arange(limit + 1)
nums[1] = 0
for i in range(2, limit):
if nums[i] != 0:
nums[i+i::i] = 0
print(np.nonzero(nums))
However, the first version is faster by my testing so I'd end up just going back to it if I really cared about performance.You can't entirely get rid of the loops, but at least the inner one can be vectorized, I think. That would reduce the "pythonic runtime" from squared to linear.
Maybe something like this?
result = np.full((limit+1,), True, dtype=bool)
result[0] = False
indices = np.arange(len(result))
for i in range(2, len(result)):
result[indices[i:] * i] = False
This is just out of my head, so may contain bugs. Also, the indices array may need to be clamped, not sure right now if numpy accepts out-of-bounds indices. def primes(n):
"""Input: A natural number n.
Output: An array of primes, p < n."""
assert n >= 2
sieve = np.ones(n / 2, dtype=np.bool)
for i in range(3, int(n ** 0.5) + 1, 2):
if sieve[i / 2]:
sieve[i * i / 2::i] = False
return np.r_[2, 2 * np.nonzero(sieve)[0][1::] + 1]
It still relies on looping and I have no idea if you can fully vectorize everything, but it does at least get quite a bit of speedup thanks to the ability to index one array using another, so I can avoid the inner loop you have to use in pure Python.I have absolutely no clue what that final line is doing and I'm not sure I understood it 12 years ago, either.
Here's a start:
import numpy as np
def sieve(limit):
result = np.ones(limit + 2, dtype=bool)
result[0] = False
result[1] = False
for i in range(2, limit + 1):
if result[i]:
yield i
result[::i] = False
for prime in sieve(100):
print(prime)
As you said, you can't vectorize the outer loop because each iteration depends on the result of previous iterations, so I think you can't do too much better than that with this algorithm. (There are a few easy algorithm optimizations, but I think that's kind of orthogonal to your point, and anyway you can do them in any language.)I would expect there's still a significant performance gap between that and, say, a simple C implementation, but that shouldn't be surprising, and personally I haven't encountered these supposed droves of people claiming that NumPy fully solves the performance gap. From what I've seen there was a general consensus that vectorizing with NumPy can significantly tighten the gap but can rarely fully close it.
> Some functions have axes arguments. Some have different versions with different names. Some have Conventions. Some have Conventions and axes arguments. And some don’t provide any vectorized version at all.
> But the biggest flaw of NumPy is this: Say you create a function that solves some problem with arrays of some given shape. Now, how do you apply it to particular dimensions of some larger arrays? The answer is: You re-write your function from scratch in a much more complex way. The basic principle of programming is abstraction—solving simple problems and then using the solutions as building blocks for more complex problems. NumPy doesn’t let you do that.
Usually when I write Matlab code, the vectorized version just works, and if there are any changes needed, they're pretty minor and intuitive. With numpy I feel like I have to look up the documentation for every single function, transposing and reshaping the array into whatever shape that particular function expects. It's not very consistent.
There are also lots of gotchas related to types returned by various functions and operations.
A particularly egregious example: for a long time, the class for univariate polynomial objects was np.poly1d. It had lots of conveniences for doing usual operations on polynomials
If you multiply a poly1d object P on the right by a complex scalar z0, you get what you probably expect: a poly1d with coefficients scaled by z0.
If however you multiply P on the left by z0, you get back an array of scaled coefficients -- there's a silent type conversion happening.
So
P*z0 # gives a poly1d
z0*P # gives an array
I know that this is due to Python associativity rules and laissez-faire approach to datatypes, but it's fairly ugly to debug something like this!Another fun gotcha with poly1d: if you want to access the leading coefficient for a quadratic, you can do so with either P.coef[0] or P[2]. No one will ever get these confused, right?
These particular examples aren't really fair because the numpy documentation describes poly1d as a part of the "old" polynomial API, advising new code to be written with the `Polynomial` API -- although it doesn't seem it's officially deprecated and no warnings are emitted when you use poly1d.
Anyway, there are similar warts throughout the library. Lots of gotchas having the shape of silent type conversions and inconsistent datatypes returned by the same method depending on its inputs that are downright nightmarish to debug
But yeah, I learned Fortran to use with R lol. And it is nice. Such easy interop.
Compared to NumPy, Xarray is a little thin in certain areas like linear algebra, but since it's very easy to drop back to NumPy from Xarray, what I've done in the past is add little helper functions for any specific NumPy stuff I need that isn't already included, so I only need to understand the NumPy version of the API well enough one time to write that helper function and its tests. (To be clear, though, the majority of NumPy ufuncs are supported out of the box.)
I'll finish by saying, to contrast with the author, I don't dislike NumPy, but I do find its API and data model to be insufficient for truly multidimensional data. For me three dimensions is the threshold where using Xarray pays off.
The docs say that it's a prototype feature, and I think it has been that way for a few years now, so no idea how production-ready it is.
For a while I had a feeling that I was perhaps a little crazy for seeming to be only person to really dislike the use of things like ‘array[:, :, None]’ and so forth.
Indexing like `da.sel(x=some_x).isel(t=-1).mean(["y", "z"])` makes code so easy to write and understand.
Broadcasting is never ambiguous because dimension names are respected.
It's very good for geospatial data, allowing you to work in multiple CRSs with the same underlying data.
We also use it a lot for Bayesian modeling via Arviz [1], since it makes the extra dimensions you get from sampling your posterior easy to handle.
Finally, you can wrap many arrays into datasets, with common coordinates shared across the arrays. This allows you to select `ds.isel(t=-1)` across every array that has a time dimension.
[0] most of the above is fiction
Also worth noting the Array API standard exists now. This is generally also trying to straighten out the sharp edges.
In my view, python+numpy is not actually an array language. Numpy as a library adds vectorization operations to python in order to help with speed. This is different. It does not (intend to) bring the advantages that array language syntax has, even if it was a bit more consistent.
Numpy is what it is. Seems more disappointing that he had trouble with the more flexible libraries like Jax.
Anyway there’s a clear split between the sort of functionality that Numpy specializes in and the sort that Jax does, and they don’t benefit much from stepping on each other’s toes, right?
Also treating things like `np.linalg.solve` as a black box that is the fastest thing in the world and you could never any better so please mangle code to call it correctly... that's just wrong. There's many reasons to build problem specific linear algebra kernels, and that's something that is inaccessible without going deeper. But that's a different story.
This is true of modern Fortran also.
[2] "in recent years" because that's when I became aware of this stuff, not to say there wasn't effort before then
Matlab is about as slow without readily vectorized operations as Python.
Slowness of Python is a huge pain point, and Julia has a clear advantage here. Sadly Julia is practically just unusable beyond quite niche purposes.
Python does now have quite serviceable jit hacks that let one escape vectoeization tricks, but they are still hacks and performant Python alternative would be very welcome. Sadly there aren't any.
Modern AIs are literal translation engines. That's their origin. The context that lets them do what they do was originally there to allow good translations. It doesn't matter if the translation is programming language output, or actual language output. I can today ask an AI to rewrite a chunk of Matlab code into Rust and it works! There's even code generators that will utilize the GPU where sensible.
We're really not that far off that we can write code in any language and transparently behind the scenes have it actually running on a more performant backend when needed. Ideally you keep the Matlab/Python frontend and the translation will be on the intermediate layers in a two way fashion so step through/debug works as expected.
Based on playing with the above steps manually with good results we're almost at the stage of just needing some glue to make it all work. Write in Matlab/Python, and run as fast as any LLVM backed language.
As in the language is regularly benchmarked at 10-100x slower than other languages. It's not a languages for caring about subtle performance characteristics outside "am i calling the external library correctly or not?" (as in this article).
Why? Just the ecosystem?
We can only hope that Julia somehow wins and those of forced to work in python because of network effects can be freed.
This kind of “clever” API seems to be both a benefit and curse of the Python ecosystem in general. It makes getting started very easy, but also makes mastery maddeningly difficult.
Maybe it being explicit gets in the way and become inelegant once you get some experience.
The first time I came across Einsums was via the Tullio.jl package, and it seemed like magic to me. I believe the equivalent of this would be:
@tullio D[k, n] = 1/(L*M) * A[k, l, m] * B[l, n] * C[k, m]
which is really close to the mathematical notation.To my understanding, template strings from PEP 750 will allow for something like:
D = 1/(L*M) * np.einsum(t'{A}[k,l,m] * {B}[l,n] * {C}[k,m]')
right? If so, that'd be pretty neat to have.You have ten different libraries that try to behave like 4 other languages and the only point of standardization is that there is usually something like .to_numpy() function. This means that most of the time I was not solving any specific problem related to data processing, but just converting data from a format one library would understand to something another library would understand.
In Julia (a language with it's own problems, of course) things mostly just work. The library for calculating uncertainties interacts well with a library handling units and all this works fine with the piloting library, or libraries solving differential equations etc. In python, this required quite a lot of boilerplate.
Of course, it would be nice if scipy had a block diagonal solver (maybe it does?). But yea, I mean, of course it would be nice if my problem was always built in functionality of the library, lol.
Maybe a bsr_matrix could work.
TBH, if you would've asked me yesterday if I'm the sort of person who might get sucked in by a cliffhanger story about a numpy replacement, I'm pretty sure I would've been an emphatic no.
But I have, in fact, just tried random things in numpy until something worked...so, you know...tell me more...
I was at a conference the other day, and my friend (a very smart professor) was asking me if it would be possible to move away from Xarray to Pandas or Polars...
Perhaps using Numba or Cython (with loops) might make it fast but less prone to confusion.
Luckily for me, I mostly stay in tabular data (< 3 dimensions).
def multi_head_attention_golfed(X, W_q, W_k, W_v, W_o, optimize="optimal"):
scores = np.einsum('si,hij,tm,hmj->hst', X, W_q, X, W_k, optimize=optimize)
weights = softmax(W_k.shape[-1]**-0.5 * scores, axis=-1)
projected = np.einsum('hst,ti,hiv,hvd->shd', weights, X, W_v, W_o, optimize=optimize)
return projected.reshape(X.shape[0], W_v.shape[2])
This is indeed twice as fast as the vectorized implementation, but, disappointingly, the naive implementation with loops is even faster. Here is the code if someone wants to figure out why the performance is like that: https://pastebin.com/raw/peptFyCwMy guess is that einsum could do a better job of considering cache coherency when evaluating the sum.
*This is ML, not Deep Learning or Transformers.
Last time I checked Matlab (that was surely decade ago) it actually filled memory with copied data.
y = linalg.solve(A,x[:,:,None])
The documentation given says that A should be (..., M, M), and x should be (..., M, K). So if A is 100x5x5 and x is 100x5, then all you need to do is convert x to 100x5x1.Is that right? It doesn't seem that bad.
For example, in the linalg.solve problem, based on reading the documentation for <60 seconds I had two guesses for what would work, and from experimenting it looks like both work equally. If you don't want to take the time to understand the documentation or experiment to see what works, then just write the for loop.
For indexing problems, how about just don't use weird indexing techniques if you don't understand them? I have literally never needed to use a list of lists to index an array in my years of using numpy. And if you do need to use it for some reason, spend 1 minute to test out what happens. "What shape does B have?" is a question that can be answered with `print(B.shape)`. If the concern is about reading code written by other people, then context should make it clear what's happening, and if it's not clear then that's the fault of the person who wrote the code for not using sensible variable names/comments.
Create an array like this np.zeros((10, 20, 30)), with shape (10, 20, 30).
Clearly if you ask for array[0][:, [0,1]].shape you will get an output of shape (20, 2), because you first remove the first dimension, and then : broadcasts across the second dimension, and you get the first two elements of the third dimension.
Also, obviously we know that in numpy array[a][b][c] == array[a,b,c]. Therefore, what do you think array[0, :, [0,1]].shape returns? (20, 2)? Nope. (2, 20).
Why? Nobody knows.
einsum only being able to do multiplication makes it quite limited. If we leaned into the Einstein notation (e.g. [1]), we could make something that is both quite nice and quite performant.
And I learned from someone more senior than me that you should instead label your variables with single-letter axes names. This way, the reader reads regular non-einsum operations and they still have the axes information in their mental model. And when you use numpy.einsum these axes labeling information become duplicated.
from torchdim import softmax
def multiheadattention(q, k, v, num_attention_heads, dropout_prob, use_positional_embedding):
batch, query_sequence, key_sequence, heads, features = dims(5)
heads.size = num_attention_heads
# binding dimensions, and unflattening the heads from the feature dimension
q = q[batch, query_sequence, [heads, features]]
k = k[batch, key_sequence, [heads, features]]
v = v[batch, key_sequence, [heads, features]]
# einsum-style operators to calculate scores,
attention_scores = (q*k).sum(features) * (features.size ** -0.5)
# use first-class dim to specify dimension for softmax
attention_probs = softmax(attention_scores, dim=key_sequence)
# dropout work pointwise, following Rule #1
attention_probs = torch.nn.functional.dropout(attention_probs, p=dropout_prob)
# another matrix product
context_layer = (attention_probs*v).sum(key_sequence)
# flatten heads back into features
return context_layer.order(batch, query_sequence, [heads, features])
However, my impression trying to get a wider userbase is that while numpy-style APIs maybe are not as good as some better array language, they might not be the bottleneck for getting things done in PyTorch. However, other domains might suffer more, and I am very excited to see a better array language catch on.The inconsistency for argument naming is also annoying (even more if we include scipy), e.g. why is it np.fft.fft(x, axis=1) but np.fft.fftshift(x, axes=1)?!
dima55•3h ago
alanbernstein•3h ago