How often does Python allocate?

https://zackoverflow.dev/writing/how-often-does-python-allocate/

109•ingve•3mo ago

Comments

zahlman•3mo ago

A caveat applies to the entire analysis that CPython may be the reference implementation, but it's still just one implementation. This sort of thing may work totally differently in PyPy, and especially in implementations that make use of another garbage-collecting runtime, such as Jython.

> Let’s take out the print statement and see if it’s just the addition:

Just FWIW: the assignment is not required to prevent optimizing out the useless addition. It isn't doing any static analysis, so it doesn't know that `range` is the builtin, and thus doesn't know that `i` is an integer, and thus doesn't know that `+` will be side-effect-free.

> Nope, it seems there is a pre-allocated list of objects for integers in the range of -5 -> 1025. This would account for 1025 iterations of our loop but not for the rest.

1024 iterations, because the check is for numbers strictly less than `_PY_NSMALLPOSINTS` and the value computed is `i + 1` (so, `1` on the first iteration).

Interesting. I knew of them only ranging up to 256 (https://stackoverflow.com/questions/306313).

It turns out (https://github.com/python/cpython/commit/7ce25edb8f41e527ed4...) that the change is barely a month old in the repository; so it's not in 3.14 (https://github.com/python/cpython/blob/3.14/Include/internal...) and won't show up until 3.15.

> Our script appears to actually be reusing most of the PyLongObject objects!

The interesting part is that it can somehow do this even though the values are increasing throughout the loop (i.e., to values not seen on previous iterations), and it also doesn't need to allocate for the value of `i` retrieved from the `range`.

> But realistically the majority of integers in a program are going to be less than 2^30 so why not introduce a fast path which skips this complicated code entirely?

This is the sort of thing where PRs to CPython are always welcome, to my understanding. It probably isn't a priority, or something that other devs have thought of, because that allocation presumably isn't a big deal compared to the time taken for the actual conversion, which in turn is normally happening because of some kind of I/O request. (Also, real programs probably do simple arithmetic on small numbers much more often than they string-format them.)

petters•3mo ago

> that make use of another garbage-collecting runtime, such as Jython

I think that is mostly of historical interest. For example, it still does not support Python 3 and has not been updated in a very long time

zahlman•3mo ago

Ah, sorry to hear it. I'd lost track. It looks like IronPython is still active, but way behind. Or rather, maybe they have no interest in implementing Python 3 features beyond 3.4.

cogman10•3mo ago

Graalpy [1] is where it's at if you want python on a JVM.

[1] https://www.graalvm.org/python/

sushibowl•3mo ago

With respect to tagged pointers, there seems to be some recent movements on that front in CPython: https://github.com/python/cpython/issues/132509

notatallshaw•3mo ago

Unfortunately that was posted 1 month before the Faster CPython project was disbanded by Microsoft, so I imagine things have slowed.

nu11ptr•3mo ago

I admit it may just be because I'm a PL nerd, but I thought it was general knowledge that pretty much EVERYTHING in Python is an object, and an object in Python is always heap allocated AFAIK. This goes deeper than just integers. Things most think of as declarative (like classes, modules, etc.) are also objects, etc. etc. It is both the best thing (for dynamism and fun/tinkering) and worst thing (performance optimization) about Python.

If you've never done it, I recommend using the `dir` function in a REPL, finding interesting things inside your objects, do `dir` on those, and keep the recursion going. It is a very eye opening experience as to just how deep the objects in Python go.

zbentley•3mo ago

Yeah, spelunking around this is a great way to learn a language!

That said, “is an object” is a bit of an overloaded term in this context. Most things in Python can be semantically addressed as objects (they have fields and methods, “dir” interrogates them, and so on), but not everything is necessarily backed by the same kind of heap-allocated memory object. Some builtins and bytecode-level optimizations, similar to (but not quite the same as) the ones discussed in TFA, result in things being less object-like (that is, PyObject*-ish structs with refcounting and/or pointer chasing on field access) in some cases.

And that’s an important difference, not a nitpick! It means that performance-interested folks can’t always use the same intuition for allocation/access behavior that they do when determining what fields are available on a syntactic object.

(And before the inevitable “if you want performance don’t write Python/use native libraries” folks pipe up: people still regularly have to care about making plain Python code fast. You might wish it weren’t so, but it is.)

nu11ptr•3mo ago

Yeah, I probably should have been careful there. Good call out that not every "object" is necessarily a heap allocated entity. I learned something, thx.

winstonewert•3mo ago

> but not everything is necessarily backed by the same kind of heap-allocated memory object.

Do you have an example, I thought literally everything in Python did trace to a PyObject*.

zahlman•3mo ago

Quite frankly, as someone with 20+ years of experience I have no idea what the basis is for what you're saying here.

> not everything is necessarily backed by the same kind of heap-allocated memory object.

`None`, `True` and `False` are; integers are; functions, bound methods, properties, classes and modules are... what sort of "everything" do you have in mind? Primitive objects don't get stored in a special way that can avoid the per-object memory overhead, and such objects exist for things that can't be used as values at all (e.g. passed to or returned from a function) in many other languages.

Some use fields like `tp_slots` in special ways; the underlying implementation of their methods is in C and the method attributes aren't looked up in the normal way. But there is still a PyObject* there.

... On further investigation (considering the values reported by `id`) I suppose that (at least in modern versions) the pre-created objects may be in static storage... ? Perhaps that has something to do with the implementation of https://peps.python.org/pep-0683/ ?

> Some builtins and bytecode-level optimizations

String concatenation may mutate a string object that's presented to Python as immutable, so long as there aren't other references to it. But it's still a heap-allocated object. Similarly, Python pre-allocates objects for small integer values, but they're still stored on the heap. Very short strings may, along with those integers, go in a special memory pool, but that pool is still on the heap and the allocation still contains all the standard PyObject fields.

> people still regularly have to care about making plain Python code fast.

Can you give examples of the people involved, or explain why their use cases are not satisfied by native libraries?

zbentley•2mo ago

> what sort of "everything" do you have in mind?

I’ll answer in two parts.

First, in my original claim:

> not everything is necessarily backed by the same kind of heap-allocated memory object.

the emphasized part is important. Not all PyObjects are equivalent. Most things in Python are indeed PyObjects of some stripe, but they act very, very differently in some common and performance-sensitive situations. The optimizations discussed in the article are one example: allocations can be re-used, and some more complex data structures are interned/cached (like the short-tuple/short-list optimization, which is similar but not the same to how floats/numbers are reused via allocator pools: https://rushter.com/blog/python-lists-and-tuples/). As you point out, immortal objects and fully preallocated objects are also often special cases vis-a-vis the performance of creating/discarding objects.

Second, you’re wrong. PyObjects do not come into play in a wide variety of cases:

- In some (non-default even today, I believe) interpreter configurations, tagged pointers are used for some numbers, removing pointer chasing entirely: https://rushter.com/blog/python-lists-and-tuples/

- Bound method objects can be cached or bypassed, meaning that, while there’s a PyObject for the object being method-called on and the method, there’s sometimes one less PyObject for the binding: https://github.com/python/cpython/issues/70298, with additional/somewhat related info in PEP-580 and PEP-590. PyPy improves on this further: https://doc.pypy.org/en/latest/interpreter-optimizations.htm...

- None, False, and other “existential” types are, as you say, PyObjects. But they’re often not accessed as PyObjects and have much lower overhead when used in some cases. Even beyond the performance savings of “is” comparing by address (technically by id(), but that maps to address in CPython), special bytecode instructions like POP_JUMP_IF_NOT_NONE are used to not even introspect the "None" PyObject at all during comparison on some paths. Compare the “dis” output for an “x is None” check versus “x is object()” check to see the tip of the iceberg here.

- New/rolling out features like JIT compilation and tail-call optimization further invalidate the historical wisdom to consider everything an object: calling a function may not create PyObjects for argument stacks; accessing known-unchanged object fields may cache accessor results, and so on. But that’s all very new/not released yet; this isn’t meant as a “gotcha” to disprove your claim alone.

- While-True/While-1 loops don’t load or interact with the always-truthy constant value’s PyObject at all while they run, in many cases: https://github.com/python/cpython/blob/main/Lib/test/test_pe...

- Constant folding (which happens at a higher level than interning and allocation caching) deduplicates pyobjects, allowing identity/equality caches to be used more frequently, and invalidating some of the historical advice to consider nonscalar data structure comparison arbitrarily expensive.

- Things like LOAD_ATTR (field/method accessors) retain their own caches, invalidating the wisdom that “obj.thing” always pays a pointer-chasing cost or a binding-creating cost. In many (I’d go as far as guessing it’s “most”, though I don’t have data) looping cases, attribute access on builtins/slotted classes is always returning attributes from the cache without addressing any PyObject fields deeper than a single pointer. That's very different from the overhead of deep diving through the MRO to make a lookup. Is it still a PyObject and still on the heap? Sure! But are you paying the same performance cost as accessing uncached/seven-pointers-away heap data? That answer is much less cut and dried.

- Many many more examples I did not think of off the cuff.

The usual caveats apply: the above applies largely to CPython, and optimization behavior can and will change between interpreter versions.

> Can you give examples of the people involved, or explain why their use cases are not satisfied by native libraries?

I mean, the most significant fact in support of the claim is that the above optimizations exist. They were made in response to identified need. Even if you take a highly uncharitable view of CPython maintainers’ priorities, the volume of optimization work speaks to real value delivered by making pure python fast.

Beyond that, anecdotally, everywhere I’ve worked on Python--big SaaS platforms, million+ RPS distributed job/queueing systems, scientific research projects--over the last 15 years (hello fellow veteran!) has routinely had engineers who needed to optimize pure-python ops as part of ordinary, mundane, day-job tasks.

Even once you remove optimizations they performed that would have been necessary in any language (stop copying large data objects, aggregate I/O, and so on), I’ve worked with … hell, probably hundreds of engineers at this point that occasionally needed to do work like “okay, I’m creating tons of tuples in a loop and it’s slow, how can I make sure to recycle/resize things such that the interning cache handles more of my code?”. Silly tricks like "make some of the tuples longer than they need to be so that length-based interning has cache hits for individual items" are sometimes useful and necessary! Yes, that's seriously funky code and should not be the first resort for anyone (and merits a significant comment/tests re: validating behavior on future interpreters), but sometimes those things produce point fixes which yield multiple percentage points of improvement on memory/GC pressure without introducing new libraries or more pervasive refactors. That's not nothing!

Sure, many of those cases would have been faster if numpy or another native-code library were in play. But lots of that code didn’t have or need numpy/extensions for anything else, including it would (depending on how long ago we’re talking) have required arduous installation/compatibility acrobatics, sometimes third-party modules are difficult to get approved, and, well, optimization needs were routinely met by applying a little knowledge of CPython’s optimizations anyway, so I’d say optimizing pure Python is both a valid approach and a common need.

zbentley•2mo ago

Correction on my first bullet, regarding tagged pointers. The link was wrong (mistaken duplication of the link used above), and the tagged pointer optimization is not present in CPython, but only in PyPy, so it largely does not pertain to this discussion: https://doc.pypy.org/en/latest/config/objspace.std.withsmall...

rogerbinns•3mo ago

Heap allocation does also allow other things. For example stack frames (PyFrame) are also heap allocated. When there is an exception, the C stack gets unwound but the heap allocated frames are retained forming the traceback. You can then examine the values of the local variables at each level of the traceback.

And it also allows async functions, since state is held off the C stack, so frames can be easily switched when returning to the event loop.

The other thing made easy is C extension authoring. You compile CPython without free lists and an address sanitizer, and getting reference counting wrong shows up.

cipherself•3mo ago

> I recommend using the `dir` function in a REPL

A while back I wrote this https://mohamed.computer/posts/python-internals-cpython-byte..., perhaps it's interesting for people who use `dir` and wonder what some of the weird things that show up are.

godshatter•3mo ago

C gets a lot of crap, sometimes for good reason, but one thing I like about it is that the question of whether C is allocating something is easy to answer, at least for your own code.

manwe150•3mo ago

Python is entirely a C program, ergo by this article, this seems like one of those fallacies C programs believe justifies using C

alt187•3mo ago

Yeah, if you consider the Python interpreter "your own code".

mkoubaa•3mo ago

I hold the manufacturer responsible for every last nut and bolt in my car

bee_rider•3mo ago

It is nice.

Although, there are also modern, beautiful, user friendly languages where allocation is mostly obvious. Like Fortran.

mr_mitm•3mo ago

Oh my. You don't see Fortran being called modern very often.

cozzyd•3mo ago

Writing C like Fortran (no heap) is even better when you can!

pklausler•3mo ago

This comment is 35 years out of date.

1718627440•3mo ago

Depends on the processor you target.

cozzyd•3mo ago

Sadly most Fortran I come across has not evolved with the language

mkoubaa•3mo ago

C programmers never beating the allegations of not using libraries

mgkuhn•3mo ago

There are reasons why the same program in Julia can be 60x faster than in Python, see e.g. slide 5 in https://www.cl.cam.ac.uk/teaching/2526/TeX+Julia/julia-slide... for an example.

jakobnissen•3mo ago

The function on that slide is dominated by the call to rand, which uses quite different implementations in Julia and Python, so may not be the best example.

Julia is compiled and for simple code like that example code will have performance on par with C, Rust etc.

Qem•3mo ago

I tested how PyPy performs on that. Just changing the implementation of Python drops the runtime from ~16.5s to ~3.5s in my computer, approximately a 5x speedup:

  xxxx@xxxx:~
  $ python3 -VV
  Python 3.11.2 (main, Apr 28 2025, 14:11:48) [GCC 12.2.0]
  xxxx@xxxx:~
  $ pypy3 -VV
  Python 3.9.16 (7.3.11+dfsg-2+deb12u3, Dec 30 2024, 22:36:23)
  [PyPy 7.3.11 with GCC 12.2.0]
  xxxx@xxxx:~
  $ cat original_benchmark.py
  #-------------------------------------------
  import random
  import time
  
  def monte_carlo_pi(n):
      inside = 0
      for i in range(n):
          x = random.random()
          y = random.random()
          if x**2 + y**2 <= 1.0:
              inside += 1
      return 4.0 * inside / n
  
  # Benchmark
  start = time.time()
  result = monte_carlo_pi(100_000_000)
  elapsed = time.time() - start
  
  print(f"Time: {elapsed:.3f} seconds")
  print(f"Estimated pi: {result}")
  #-------------------------------------------
  xxxx@xxxx:~
  $ python3 original_benchmark.py
  Time: 16.487 seconds
  Estimated pi: 3.14177012
  xxxx@xxxx:~
  $ pypy3 original_benchmark.py
  Time: 3.357 seconds
  Estimated pi: 3.14166756
  xxxx@xxxx:~
  $ python3 -c "print(round(16.487/3.357, 1))"
  4.9

I changed the code to take advantage of some basic performance tips that are commonly given for CPython (taking advantage of stardard library - itertools, math; prefer comprehensions/generator expressions to loose for loops), and was able to get CPython numbers improve by ~1.3x. But then PyPy numbers took a hit:

  xxxx@xxxx:~
  $ cat mod_benchmark.py
  #-------------------------------------------
  from itertools import repeat
  from math import hypot
  from random import random
  import time
  
  def monte_carlo_pi(n):
      inside = sum(hypot(random(), random()) <= 1.0 for i in repeat(None, n))
      return 4.0 * inside / n
  
  # Benchmark
  start = time.time()
  result = monte_carlo_pi(100_000_000)
  elapsed = time.time() - start
  
  print(f"Time: {elapsed:.3f} seconds")
  print(f"Estimated pi: {result}")
  #-------------------------------------------
  xxxx@xxxx:~
  $ python3 mod_benchmark.py
  Time: 12.998 seconds
  Estimated pi: 3.14149268
  xxxx@xxxx:~
  $ pypy3 mod_benchmark.py
  Time: 12.684 seconds
  Estimated pi: 3.14160844
  xxxx@xxxx:~
  $ python3 -c "print(round(16.487/12.684, 1))"
  1.3

dragonwriter•3mo ago

I tested staying in CPython but jitting the main function with numba (no code changes but adding the jit decorator and expected type signature, and adding the same jit warmup call before the benchmark that the Julia version uses), and its about an 11× speedup. Code:

    import random
    import time
    from numba import jit, int32, float64


    @jit(float64(int32), nopython=True)
    def monte_carlo_pi(n):
        inside = 0
        for i in range(n):
            x = random.random()
            y = random.random()
            if x**2 + y**2 <= 1.0:
                inside += 1
        return 4.0 * inside / n

    # Warm up (compile)
    monte_carlo_pi(100)

    # Benchmark
    start = time.time()
    result = monte_carlo_pi(100_000_000)
    elapsed = time.time() - start
    print(f"Time: {elapsed:.3f} seconds")
    print(f"Estimated pi: {result}")

Base version (using the unmodified Python code from the slide):

   $ python -m monte
   Time: 13.758 seconds
   Estimated pi: 3.14159524

Numba version:

   $ python -m monte-numba
   Time: 1.212 seconds
   Estimated pi: 3.14143924

boothby•3mo ago

Once upon a time, I wanted to back a stack with a linked list in Python. I had been reading a lot of compiled bytecode, and had recently learned that CPython is a stack-based language capable of unrolling and popping tuples as singular bytecode instructions. I also learned about the freelist.

I ended up with the notation

  Initialization:
    head = ()
  Push:
    head = data, head
  Safe Pop:
    if head:
       data, head = head
  Safe Top:
    head[0] if head else None

And for many stack-based algorithms, I've found this to be quite optimal in part because the length-2 tuples get recycled (also due to a lack of function calls, member accesses, etc). But I'm rather embarrassed to put it into a codebase due to others' expectations that Python should be beautiful and this seems weird.

plant-ian•3mo ago

I guess the only downside might be if you need to push a bunch of things at once. But maybe just a loop is faster than calling extend on a list? This is cool.

boothby•3mo ago

Yep, that's the main exception when I said "many stack-based algorithms". It's worth mentioning that examining the stack while debugging is annoying because of the number of parens involved.

Asooka•3mo ago

If you can afford a function call, you can just borrow LISP's names for these functions, as these are literally LISP lists, e.g.:

    def cons(head, tail=()):
        return (head, tail)

    def snoc(pair):
        '''
        Decompose (kind of) pair, transforming () to None, ().
        So you can write slightly clearer code:
        pair = cons(head, tail)
        head, tail = snoc(pair)
        '''
        if pair: return pair
        else: return (None, ())

    def car(pair):
        return pair[0] if pair else None

    def cdr(pair):
        return pair[1] if pair else ()

    def cons_iter(stack):
        '''
        Iterate stack, e.g.
        for item in cons_iter(stack): ...
        '''
        while stack:
            head, stack = stack
            yield head

May you write much LISP in Python.

boothby•3mo ago

Well part of the point is that I detest the function call overhead of CPython but yes what I described is precisely how lispers roll their linked lists! And several years later I am writing about as much lisp as I am Python (and making people's eyes bleed with macros and functional approaches to problems if I let my worst instincts take over, but alas)

whilenot-dev•3mo ago

A bit beside the point, but this caught my eye:

> Integers are likely the most used data type of any program, that means a lot of heap allocations.

I would guess strings come first, then floats, then booleans, and then integers. Are there any data available on that?

nemothekid•3mo ago

Given that most string implementations have both a `len` and `cap` field (or eve just `len`), means that for every string data type, there is one more integer variables being used.

So integer definitely gets used more than strings, and I'd argue way more than floats in most programs.

zahlman•3mo ago

In CPython, a string's length is stored internally as a machine integer (`Py_ssize_t`); when the `len` builtin is called, the corresponding integer object is created (or more likely, retrieved from cache) on the fly.

ragnese•3mo ago

Ints probably get a big boost in languages where the only built-in for-loop syntax involves incrementing an index variable, like C. And, speaking of C, specifically, even the non-int types are actually ints or isomorphic to ints: enums, bools, char, pointers, arrays (which are just pointers if you squint), etc...

But, otherwise, I'd agree that strings probably win, globally.

tialaramex•3mo ago

Actually because of provenance the C pointers are their only type which isn't just basically the machine integers again.

A char is just a machine integer with implementation specified signedness (crazy), bools are just machine integers which aren't supposed to have values other than 0 or 1, and the floating point types are just integers reinterpreted as binary fractions in a strange way.

Addresses are just machine integers of course, but pointers have provenance which means that it matters why you have the pointer, whereas for the machine integers their value is entirely determined by the bits making them up.

ragnese•3mo ago

It's been a long time since I've done C/C++, but I'm not sure what you're saying with regard to provenance. I was pretty sure that you were able to cast an arbitrary integer value into a pointer, and it really didn't have to "come from" anywhere. So, all I'm saying is that, under-the-hood, a C pointer really is just an integer. Saying that a pointer means something beyond the bits that make up the value is no more relevant than saying a bool means something other than its integer value, which is also true.

tialaramex•3mo ago

Start your journey here: https://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_260.htm

That's defect report #260 against the C language. One option for WG14 was to say "Oops, yeah, that should work in our language" and then modern C compilers have to be modified and many C programs are significantly slower. This gets the world you (and you're far from alone among C programmers) thought you lived in, though probably your C programs are slower than you expected now 'cos your "pointers" are now just address integers and you've never thought about the full consequences.

But they didn't, instead they wrote "They may also treat pointers based on different origins as distinct even though they are bitwise identical" because by then that is in fact how C compilers work. That's them saying pointers have provenance, though they do not describe (and neither does their ISO document) how that works.

There is currently a TR (I think, maybe wrong letters) which explains PNVI-ae-udi, Provenance Not Via Integers, Addresses Exposed, User Disambiguates which is the current preferred model for how this could possibly work. Compilers don't implement that properly either, but they could in principle so that's why it is seen as a reasonable goal for the C language. That TR is not part of the ISO standard but in principle one day it could be. Until then, provenance is just a vague shrug in C. What you said is wrong, and er... yeah that is awkward for everybody.

Rust does specify how this works. But the bad news (I guess?) for you is that it too says that provenance is a thing, so you cannot just go around claiming the address you dredged up from who knows where is a valid pointer, it ain't. Or rather, in Rust you can write out explicitly that you do want to do this, but the specification is clear that you get a pointer but not necessarily a valid pointer even if you expected otherwise.

adgjlsfhk1•3mo ago

it is kind of fun to me that most C programmers believe that ISO C exists/is implemented anywhere, when in reality we have a bunch of compilers that claim to compile iso C, but just have a lot of random basically unfixable behavior.

Asooka•3mo ago

I do believe the C standards committee got it completely backwards with regards to undefined behaviour optimisations. By default the language should act in a way that a human can reason about, in particular it should not be more complicated than assembly. Then, they can add some mechanism for decorating select hot program blocks as amenable to certain optimisations[1]. In the majority of the program the optimisation of not writing a single machine word to memory before calling memcmp is not measurable. The saddest part is that other languages like Rust and Zig have picked all this up like cargo cult language design. Writing code is already complicated enough without having to watch out for pitfalls added so the compiler can achieve one nanosecond faster time on SPECint.

[1] As an aside, the last time I tried to talk to a committee representative about undefined behaviour optimisation pitfalls, I was told that the standard does not prescribe optimisations. Which was quite puzzling, because it obviously prescribes compiler behaviour with the express goal of allowing certain optimisations. If I took that statement at face value, it would follow that undefined behaviour is not there for optimisation's sake, but rather as a fun feature to make programming more interesting...

steveklabnik•3mo ago

Rust has no UB in safe Rust, so it’s closer to your ideal than not.

It also doesn’t have UB for cargo cult reasons.

1718627440•3mo ago

A pointer derived from one allocation and a pointer from another allocation does not need to compare the same, even if the address is. And you will likely invoke UB. This is because C tries to be portable and also support segmented storage.

A pointer derived from only an integer was supposed to alias any allocation, but good luck discussing this with your optimizing C compiler.

1718627440•3mo ago

I have never written a program where you truly need floating point numbers. They are basically only useful for storing unprocessed real world data. I only ever use them for function calls to libm functions and immediately cast back to an integer. It think floats are overused.

A boolean is an integer.

Joker_vD•3mo ago

Shouldn't having a freelist and pushing and popping blocks of memory to and from it for re-use count as a form of memory allocation?

jonhohle•3mo ago

I’ve tried optimizing some python that’s in the hot path of a build system and a few dozen operations out of over 2K nodes in the ninja graph account for 25-30% of the total build time.

I’ve found python optimization to be nearly intractable. I’ve spent a significant amount of time over the past two decades optimizing C, Java, Swift, Ruby, SQL and I’m sure more. The techniques are largely the same. In Python, however, everything seems expensive. Field lookup on an object, dynamic dispatch, string/array concatenation. After optimization, the code is no longer “pythonic” (which has come to mean slow, in my vernacular).

Are there any good resources on optimizing python performance while keeping idiomatic?

ragnese•3mo ago

> Are there any good resources on optimizing python performance while keeping idiomatic?

At the risk of sounding snarky and/or unhelpful, in my experience, the answer is that you don't try to optimize Python code beyond fixing your algorithm to have better big-O properties, followed by calling out to external code that isn't written in Python (e.g., NumPy, etc).

But, I'm a hater. I spent several years working with Python and hated almost every minute of it for various reasons. Very few languages repulse me the way Python does: I hate the syntax, the semantics, the difficulty of distribution, and the performance (memory and CPU, and is GIL disabled by default yet?!)...

CraigJPerry•3mo ago

>> optimizing python performance while keeping idiomatic?

That's impossible[1].

I think it is impossible because when i identify a slow function using cProfile then use dis.dis() on it to view the instructions executed, most of the overhead - by that i mean time spent doing something other than the calculation the code describes - is spent determining what each "thing" is. It's all trails of calls trying to determine "this thing can't be __add__() to that thing but maybe that thing can be __radd()__ to this thing instead. Long way to say most of the time wasting instructions i see can be attacked by providing ctypes or some other approach like that (mypyc, maybe cython, etc etc) - but at this point you're well beyond "idiomatic".

[1] I'm really curious to know the answer to your question so i'll post a confident (but hopefully wrong) answer so that someone feels compelled to correct me :-)

kingstnap•3mo ago

You can't really. The slow parts of computers are memory and branches.

Branches cannot be reasoned about since they are all in the Cpython state machine.

Memory performance is about being efficient with allocations, cache locality, and pointer chasing.

Python has huge amounts of allocations and pointer chasing. Its by design slow.

jerf•3mo ago

"Are there any good resources on optimizing python performance while keeping idiomatic?"

It is essentially impossible, "essentially" here using not the modern sense of "mostly", but "essential" as in baked into the essence of the language. It has been the advice in the Python community pretty much since the beginning that the solution is to go to another language in that case. There are a number of solutions to the problem, ranging from trying PyPy, implementing an API in another language, Cython/Pyrex, up to traditional embedding of a C/C++ program into a Python module.

However this is one of those cases where there are a lot of solutions precisely because none of them are quite perfect and all have some sort of serious downside. Depending on what you are doing, you may find one whose downside you don't care about. But there's no simple, bullet-proof cookbook answer for "what do I do when Python is too slow even after basic optimization".

Python is fundamentally slow. Too many people still hear that as an attack on the language, rather than an engineering fact that needs to be kept in mind. Speed isn't everything, and as such, Python is suitable for a wide variety of tasks even so. But it is, still, fundamentally slow, and those who read that as an attack rather than an engineering assessment are more likely to find themselves in quite a pickle (pun somewhat intended) one day when they have a mass of Python code that isn't fast enough and no easy solutions for the problem than those who understand the engineering considerations in choosing Python.

its-summertime•3mo ago

Of solutions, Numba, Nuitka, are probably also worth mentioning.

Then just looking at those, I now know of Shedskin and ComPyler.

I do feel like one nice thing about so many people working on solutions to every problem in python, is that it means that when you do encounter a serious downside, you have a lot more flexibility to move forwards along a different path.

franktankbank•3mo ago

Python is a superior bash that still wants a basic shell to run in.

spystath•3mo ago

Alternatively: https://xon.sh/

shoo•3mo ago

I agree with this. People advocate for "pick a better algorithm", and sometimes that can help dramatically, but at times the best algorithm implemented in python is still too slow, but can be made 500x faster by reimplementing the exact same algorithm in cython or C or fortran or so on.

Python is a great language for rapidly bashing out algorithmic code, glue scripts, etc, unfortunately due to how dynamic it is, it is a language that fundamentally doesn't translate well to operations CPUs can perform efficiently. Hardly any python programs ever need to be as dynamic as what the language allows.

I've had very good experiences applying cython to python programs that need to do some kind of algorithmic number crunching, where numpy alone doesn't get the job done.

With cython you start with your python code and incrementally add static typing that reduces the layers of python interpreter abstractions and wrappings necessary. Cython has a very useful and amusing output where it spits out a html report of annotated cython source code, with lines highlighted in yellow in proportion to the amount of python overhead. you click on any line of python in that report and it expands to show how many Cpython API operations are required to implement it, and then you add more static type hints and recompile until the yellow goes away and the compute heavy kernel of your script is a C program that compiles to operations that real world CPUs can execute efficiently.

Downside of cython is the extra build toolchain and deployment concerns it drags in - if you previously had a pure python module, now you've got native modules so you need to bake platform specific wheels for each deployment target.

For Python that's being used as a glue scripting language, not number crunching, worth considering rewriting the script in something like Go. Go has a pretty good standard library to handle many tasks without needing to install many 3rd party packages, and the build, test, deploy story is very nice.

Figs•3mo ago

Depending on the nature of your code, just throwing pypy at it instead of cpython can be a huge win. This very much depends on what you're doing though. For example, I have a graphics editor I wrote that is unbearably slow with cpython; with pypy and no other changes it's usable.

If what you're trying to do involves tasks that can be done in parallel, the multithreading (if I/O bound) or multiprocessing (if compute bound) libraries can be very useful.

If what you're doing isn't conducive to either, you probably need to rewrite at least the critical parts in something else.

thomasmg•3mo ago

Pypy is great for performance. I'm writing my own programming language (that transpiles to C) and for this purpose converted a few benchmarks to some popular languages (C, Java, Rust, Swift, Python, Go, Nim, Zig, V). Most languages have similar performance, except for Python, which is about 50 times slower [1]. But with PyPy, performance is much better. I don't know the limitations of PyPy because these algorithms are very simple.

But even thougt Python is very slow, it is still very popular. So the language itself must be very good in my view, otherwise fewer people would use it.

[1] https://github.com/thomasmueller/bau-lang/blob/main/doc/perf...

pphysch•3mo ago

> I’ve found python optimization to be nearly intractable.

Move your performance critical kernels outside Python. Numpy, Pytorch, external databases, Golang microservice, etc.

It's in fact an extremely successful paradigm. Python doesn't need to be fast to be used in every domain (though async/nogil is still worth advancing to avoid idle CPUs).

tialaramex•3mo ago

However note that (and this isn't just true for Python) if your "performance critical kernel" is "the whole program" then this does have the implication you think it does, you should not write Python.

If your Python software is spending 99% of its time doing X, you can rewrite X in Rust and now the Python software using that is faster, but if your Python software is spending 7% of its time doing X, and 5% doing Y, and 3% doing Z then even if you rewrite X, and Y and Z you've shaved off no more than 15% and so it's probably time to stop writing Python at all.

This is part of why Google moved to Go as I understand it.

pphysch•3mo ago

Sure, but there's also an argument for wrapping your pure Rust or Go executable in the thinnest layer of Python so that your users can optionally `pip install` it.

1718627440•3mo ago

At that point you can use the OS package manager, which is more robust and installs programs in standard locations and incorporates things like process isolation.

pphysch•3mo ago

Then you are packaging for N different distros instead of just PyPI. Have fun...

1718627440•3mo ago

A Distro is an OS. You don't package your program for Windows and expect it to work on MacOS. Don't expect this to work in the Free Software world.

pphysch•3mo ago

A distro is not strictly an OS. Ubuntu and RHEL can both run a Go binary built for Linux amd64 but they have completely different package managers.

1718627440•3mo ago

> they have completely different package managers.

Exactly, that's one case where the OSes completely differ. Just because the OSes share the same kernel, which prioritizes userspace compatibility, doesn't mean they are not a different OS. A kernel doesn't make an OS.

pphysch•3mo ago

I think you are needlessly conflating terms. My original comment was correct that you need to explicitly package your builds for many different distro package managers, if you want to go that route. That's a lot of work to maintain.

Or you can just put it on PyPI which is everywhere nowadays.

1718627440•3mo ago

Which I as a user won't use. I trust my OS out of necessity, but I don't like to download and run random code from the internet and I don't have the time to review it. My OS instead often has patches, that strip out features I do not want and integrate it with the OS.

paulddraper•3mo ago

> In Python, however, everything seems expensive.

And that wasn't true in Ruby(?!)

The first thing to do when optimizing Python is to use a JIT (PyPy). Like, yes Python is slow if you don't allow a JIT.

Alternatively, go the other JIT-less direction but use C (and AOT language).

th0ma5•3mo ago

I spent a decade optimizing Python data science projects for a Fortune 100. I agree with the other sentiments that you can't really optimize Python itself because of these weird behaviors that seem to change the more you look at them... Mostly all of my optimization work was generic computer science stuff across the entire system and, in the data science world, this usually meant optimizing the database queries or the io throughput, or the overall shape of the kind of work that they were doing in order to parallelize it, or figure out how to simply do way less, etc. If someone actually came up to me with an actual problem with a Python code and it was the Python code that was not performing correctly, I would pretty immediately say let's use something else. I think the efforts to make a more efficient Python are really great, but I think the extreme utility of Python as a light glue to pull together lots of other things that are performing well in other computer languages is a widely recognized strength of the Python ecosystem.

mkoubaa•3mo ago

Wait for the PyNI project to accomplish its goals Python can be as fast as any other language

Waterluvian•3mo ago

I was taught a long time ago that if you want to optimize Python for CPU usage, you write C and call it using a FFI. Or you find libraries like numpy that already do that for you.

d0mine•3mo ago

Idiomatic Python code can be faster than naive C/C++ code. The secret is to offload hot paths outside pure Python. Most such things are written that way already: you do not try to optimize the internal loop for a matrix multiplication in pure Python, you call numpy.dot() instead (pytorch if GPUs can help).

Otherwise, optimizing code in Python is the same as in any other language eg

http://scipy-lectures.org/advanced/optimizing/

https://scikit-learn.org/stable/developers/performance.html

weinzierl•3mo ago

Not sure how well-known this is, but you can run Python with a different allocator by LD_PRELOADing it. We had good results using jemalloc this way.

melenaboija•3mo ago

> I will never forget Rust for...

Fuck, this made me feel old. It blows my mind that the learning path Rust then Python even exists.

When I started school in 2000, it was C, then C++, and then Java. Oh, and then compilers. It’s impossible for me not to think about allocation when doing anything, it’s just rooted in my mind.

zozbot234•3mo ago

Rust is a lot simpler than C++ and arguably Python itself, so it's not a huge surprise that it turns out to be viable as a first programming language. Especially since some parts of it can be taught as pretty much a pure functional language at the outset (actually this works quite well, right up until you really need interior mutability - which is not that often), which makes for a good intuition of what keeps the borrow checker happy.

tialaramex•3mo ago

I believe Rust could work as a First language for CS students, but I don't expect it will ever make sense as the sole language we teach to non-CS students who need programmability as a potential skill, the same way they may need stats, or some of mathematics, or various other skills (I did Contract Law for example).

The prominence of Python for those students makes sense to me, thirty years ago I'm sure a BASIC would have been fine too. It doesn't need to be elegant, or have good performance, it does matter that it's free, and that there's a lot of existing crap written in Python you can crib from.

zozbot234•3mo ago

Python makes even less sense as a language for CS-naïve students. It has bonkers semantics, and the "intuitive" syntax only really helps when writing tiny, trivial programs.

loeg•3mo ago

Historically (Python 2.x) it had a bunch of caches in front of actual system malloc. And that was (is?) reasonable with glibc default malloc being the very bad dlmalloc.

(This plays hell on things like Valgrind or ASAN.)

Show HN: A unique twist on Tetris and block puzzle

The logs I never read

How to use AI with expressive writing without generating AI slop

Show HN: LinkScope – Real-Time UART Analyzer Using ESP32-S3 and PC GUI

Cppsp v1.4.5–custom pattern-driven, nested, namespace-scoped templates

The next frontier in weight-loss drugs: one-time gene therapy

At Age 25, Wikipedia Refuses to Evolve

Show HN: ReviewReact – AI review responses inside Google Maps ($19/mo)

Why AlphaTensor Failed at 3x3 Matrix Multiplication: The Anchor Barrier

Ask HN: How much of your token use is fixing the bugs Claude Code causes?

Show HN: Agents – Sync MCP Configs Across Claude, Cursor, Codex Automatically

Hello

FSD helped save my father's life during a heart attack

Show HN: Writtte – Draft and publish articles without reformatting, anywhere

Portuguese icon (FROM A CAN) makes a simple meal (Canned Fish Files) [video]

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Transcribe your aunts post cards with Gemini 3 Pro

.72% Variance Lance

ReKindle – web-based operating system designed specifically for E-ink devices

Encrypt It

NextMatch – 5-minute video speed dating to reduce ghosting

Personalizing esketamine treatment in TRD and TRBD

SpaceKit.xyz – a browser‑native VM for decentralized compute

NotebookLM: The AI that only learns from you

Show HN: An open-source starter kit for developing with Postgres and ClickHouse

Game Boy Advance d-pad capacitor measurements

South Korean crypto firm accidentally sends $44B in bitcoins to users

Apache Poison Fountain

Web.whatsapp.com appears to be having issues syncing and sending messages

Google in Your Terminal

Show HN: A unique twist on Tetris and block puzzle

The logs I never read

How to use AI with expressive writing without generating AI slop

Show HN: LinkScope – Real-Time UART Analyzer Using ESP32-S3 and PC GUI

Cppsp v1.4.5–custom pattern-driven, nested, namespace-scoped templates

The next frontier in weight-loss drugs: one-time gene therapy

At Age 25, Wikipedia Refuses to Evolve

Show HN: ReviewReact – AI review responses inside Google Maps ($19/mo)

Why AlphaTensor Failed at 3x3 Matrix Multiplication: The Anchor Barrier

Ask HN: How much of your token use is fixing the bugs Claude Code causes?

Show HN: Agents – Sync MCP Configs Across Claude, Cursor, Codex Automatically

Hello

FSD helped save my father's life during a heart attack

Show HN: Writtte – Draft and publish articles without reformatting, anywhere

Portuguese icon (FROM A CAN) makes a simple meal (Canned Fish Files) [video]

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Transcribe your aunts post cards with Gemini 3 Pro

.72% Variance Lance

ReKindle – web-based operating system designed specifically for E-ink devices

Encrypt It

NextMatch – 5-minute video speed dating to reduce ghosting

Personalizing esketamine treatment in TRD and TRBD

SpaceKit.xyz – a browser‑native VM for decentralized compute

NotebookLM: The AI that only learns from you

Show HN: An open-source starter kit for developing with Postgres and ClickHouse

Game Boy Advance d-pad capacitor measurements

South Korean crypto firm accidentally sends $44B in bitcoins to users

Apache Poison Fountain

Web.whatsapp.com appears to be having issues syncing and sending messages

Google in Your Terminal

How often does Python allocate?

Comments