If you've never done it, I recommend using the `dir` function in a REPL, finding interesting things inside your objects, do `dir` on those, and keep the recursion going. It is a very eye opening experience as to just how deep the objects in Python go.
That said, “is an object” is a bit of an overloaded term in this context. Most things in Python can be semantically addressed as objects (they have fields and methods, “dir” interrogates them, and so on), but not everything is necessarily backed by the same kind of heap-allocated memory object. Some builtins and bytecode-level optimizations, similar to (but not quite the same as) the ones discussed in TFA, result in things being less object-like (that is, PyObject*-ish structs with refcounting and/or pointer chasing on field access) in some cases.
And that’s an important difference, not a nitpick! It means that performance-interested folks can’t always use the same intuition for allocation/access behavior that they do when determining what fields are available on a syntactic object.
(And before the inevitable “if you want performance don’t write Python/use native libraries” folks pipe up: people still regularly have to care about making plain Python code fast. You might wish it weren’t so, but it is.)
Do you have an example, I thought literally everything in Python did trace to a PyObject*.
> not everything is necessarily backed by the same kind of heap-allocated memory object.
`None`, `True` and `False` are; integers are; functions, bound methods, properties, classes and modules are... what sort of "everything" do you have in mind? Primitive objects don't get stored in a special way that can avoid the per-object memory overhead, and such objects exist for things that can't be used as values at all (e.g. passed to or returned from a function) in many other languages.
Some use fields like `tp_slots` in special ways; the underlying implementation of their methods is in C and the method attributes aren't looked up in the normal way. But there is still a PyObject* there.
... On further investigation (considering the values reported by `id`) I suppose that (at least in modern versions) the pre-created objects may be in static storage... ? Perhaps that has something to do with the implementation of https://peps.python.org/pep-0683/ ?
> Some builtins and bytecode-level optimizations
String concatenation may mutate a string object that's presented to Python as immutable, so long as there aren't other references to it. But it's still a heap-allocated object. Similarly, Python pre-allocates objects for small integer values, but they're still stored on the heap. Very short strings may, along with those integers, go in a special memory pool, but that pool is still on the heap and the allocation still contains all the standard PyObject fields.
> people still regularly have to care about making plain Python code fast.
Can you give examples of the people involved, or explain why their use cases are not satisfied by native libraries?
I’ll answer in two parts.
First, in my original claim:
> not everything is necessarily backed by the same kind of heap-allocated memory object.
the emphasized part is important. Not all PyObjects are equivalent. Most things in Python are indeed PyObjects of some stripe, but they act very, very differently in some common and performance-sensitive situations. The optimizations discussed in the article are one example: allocations can be re-used, and some more complex data structures are interned/cached (like the short-tuple/short-list optimization, which is similar but not the same to how floats/numbers are reused via allocator pools: https://rushter.com/blog/python-lists-and-tuples/). As you point out, immortal objects and fully preallocated objects are also often special cases vis-a-vis the performance of creating/discarding objects.
Second, you’re wrong. PyObjects do not come into play in a wide variety of cases:
- In some (non-default even today, I believe) interpreter configurations, tagged pointers are used for some numbers, removing pointer chasing entirely: https://rushter.com/blog/python-lists-and-tuples/
- Bound method objects can be cached or bypassed, meaning that, while there’s a PyObject for the object being method-called on and the method, there’s sometimes one less PyObject for the binding: https://github.com/python/cpython/issues/70298, with additional/somewhat related info in PEP-580 and PEP-590. PyPy improves on this further: https://doc.pypy.org/en/latest/interpreter-optimizations.htm...
- None, False, and other “existential” types are, as you say, PyObjects. But they’re often not accessed as PyObjects and have much lower overhead when used in some cases. Even beyond the performance savings of “is” comparing by address (technically by id(), but that maps to address in CPython), special bytecode instructions like POP_JUMP_IF_NOT_NONE are used to not even introspect the "None" PyObject at all during comparison on some paths. Compare the “dis” output for an “x is None” check versus “x is object()” check to see the tip of the iceberg here.
- New/rolling out features like JIT compilation and tail-call optimization further invalidate the historical wisdom to consider everything an object: calling a function may not create PyObjects for argument stacks; accessing known-unchanged object fields may cache accessor results, and so on. But that’s all very new/not released yet; this isn’t meant as a “gotcha” to disprove your claim alone.
- While-True/While-1 loops don’t load or interact with the always-truthy constant value’s PyObject at all while they run, in many cases: https://github.com/python/cpython/blob/main/Lib/test/test_pe...
- Constant folding (which happens at a higher level than interning and allocation caching) deduplicates pyobjects, allowing identity/equality caches to be used more frequently, and invalidating some of the historical advice to consider nonscalar data structure comparison arbitrarily expensive.
- Things like LOAD_ATTR (field/method accessors) retain their own caches, invalidating the wisdom that “obj.thing” always pays a pointer-chasing cost or a binding-creating cost. In many (I’d go as far as guessing it’s “most”, though I don’t have data) looping cases, attribute access on builtins/slotted classes is always returning attributes from the cache without addressing any PyObject fields deeper than a single pointer. That's very different from the overhead of deep diving through the MRO to make a lookup. Is it still a PyObject and still on the heap? Sure! But are you paying the same performance cost as accessing uncached/seven-pointers-away heap data? That answer is much less cut and dried.
- Many many more examples I did not think of off the cuff.
The usual caveats apply: the above applies largely to CPython, and optimization behavior can and will change between interpreter versions.
> Can you give examples of the people involved, or explain why their use cases are not satisfied by native libraries?
I mean, the most significant fact in support of the claim is that the above optimizations exist. They were made in response to identified need. Even if you take a highly uncharitable view of CPython maintainers’ priorities, the volume of optimization work speaks to real value delivered by making pure python fast.
Beyond that, anecdotally, everywhere I’ve worked on Python--big SaaS platforms, million+ RPS distributed job/queueing systems, scientific research projects--over the last 15 years (hello fellow veteran!) has routinely had engineers who needed to optimize pure-python ops as part of ordinary, mundane, day-job tasks.
Even once you remove optimizations they performed that would have been necessary in any language (stop copying large data objects, aggregate I/O, and so on), I’ve worked with … hell, probably hundreds of engineers at this point that occasionally needed to do work like “okay, I’m creating tons of tuples in a loop and it’s slow, how can I make sure to recycle/resize things such that the interning cache handles more of my code?”. Silly tricks like "make some of the tuples longer than they need to be so that length-based interning has cache hits for individual items" are sometimes useful and necessary! Yes, that's seriously funky code and should not be the first resort for anyone (and merits a significant comment/tests re: validating behavior on future interpreters), but sometimes those things produce point fixes which yield multiple percentage points of improvement on memory/GC pressure without introducing new libraries or more pervasive refactors. That's not nothing!
Sure, many of those cases would have been faster if numpy or another native-code library were in play. But lots of that code didn’t have or need numpy/extensions for anything else, including it would (depending on how long ago we’re talking) have required arduous installation/compatibility acrobatics, sometimes third-party modules are difficult to get approved, and, well, optimization needs were routinely met by applying a little knowledge of CPython’s optimizations anyway, so I’d say optimizing pure Python is both a valid approach and a common need.
And it also allows async functions, since state is held off the C stack, so frames can be easily switched when returning to the event loop.
The other thing made easy is C extension authoring. You compile CPython without free lists and an address sanitizer, and getting reference counting wrong shows up.
A while back I wrote this https://mohamed.computer/posts/python-internals-cpython-byte..., perhaps it's interesting for people who use `dir` and wonder what some of the weird things that show up are.
Although, there are also modern, beautiful, user friendly languages where allocation is mostly obvious. Like Fortran.
Julia is compiled and for simple code like that example code will have performance on par with C, Rust etc.
xxxx@xxxx:~
$ python3 -VV
Python 3.11.2 (main, Apr 28 2025, 14:11:48) [GCC 12.2.0]
xxxx@xxxx:~
$ pypy3 -VV
Python 3.9.16 (7.3.11+dfsg-2+deb12u3, Dec 30 2024, 22:36:23)
[PyPy 7.3.11 with GCC 12.2.0]
xxxx@xxxx:~
$ cat original_benchmark.py
#-------------------------------------------
import random
import time
def monte_carlo_pi(n):
inside = 0
for i in range(n):
x = random.random()
y = random.random()
if x**2 + y**2 <= 1.0:
inside += 1
return 4.0 * inside / n
# Benchmark
start = time.time()
result = monte_carlo_pi(100_000_000)
elapsed = time.time() - start
print(f"Time: {elapsed:.3f} seconds")
print(f"Estimated pi: {result}")
#-------------------------------------------
xxxx@xxxx:~
$ python3 original_benchmark.py
Time: 16.487 seconds
Estimated pi: 3.14177012
xxxx@xxxx:~
$ pypy3 original_benchmark.py
Time: 3.357 seconds
Estimated pi: 3.14166756
xxxx@xxxx:~
$ python3 -c "print(round(16.487/3.357, 1))"
4.9
I changed the code to take advantage of some basic performance tips that are commonly given for CPython (taking advantage of stardard library - itertools, math; prefer comprehensions/generator expressions to loose for loops), and was able to get CPython numbers improve by ~1.3x. But then PyPy numbers took a hit: xxxx@xxxx:~
$ cat mod_benchmark.py
#-------------------------------------------
from itertools import repeat
from math import hypot
from random import random
import time
def monte_carlo_pi(n):
inside = sum(hypot(random(), random()) <= 1.0 for i in repeat(None, n))
return 4.0 * inside / n
# Benchmark
start = time.time()
result = monte_carlo_pi(100_000_000)
elapsed = time.time() - start
print(f"Time: {elapsed:.3f} seconds")
print(f"Estimated pi: {result}")
#-------------------------------------------
xxxx@xxxx:~
$ python3 mod_benchmark.py
Time: 12.998 seconds
Estimated pi: 3.14149268
xxxx@xxxx:~
$ pypy3 mod_benchmark.py
Time: 12.684 seconds
Estimated pi: 3.14160844
xxxx@xxxx:~
$ python3 -c "print(round(16.487/12.684, 1))"
1.3 import random
import time
from numba import jit, int32, float64
@jit(float64(int32), nopython=True)
def monte_carlo_pi(n):
inside = 0
for i in range(n):
x = random.random()
y = random.random()
if x**2 + y**2 <= 1.0:
inside += 1
return 4.0 * inside / n
# Warm up (compile)
monte_carlo_pi(100)
# Benchmark
start = time.time()
result = monte_carlo_pi(100_000_000)
elapsed = time.time() - start
print(f"Time: {elapsed:.3f} seconds")
print(f"Estimated pi: {result}")
Base version (using the unmodified Python code from the slide): $ python -m monte
Time: 13.758 seconds
Estimated pi: 3.14159524
Numba version: $ python -m monte-numba
Time: 1.212 seconds
Estimated pi: 3.14143924I ended up with the notation
Initialization:
head = ()
Push:
head = data, head
Safe Pop:
if head:
data, head = head
Safe Top:
head[0] if head else None
And for many stack-based algorithms, I've found this to be quite optimal in part because the length-2 tuples get recycled (also due to a lack of function calls, member accesses, etc). But I'm rather embarrassed to put it into a codebase due to others' expectations that Python should be beautiful and this seems weird. def cons(head, tail=()):
return (head, tail)
def snoc(pair):
'''
Decompose (kind of) pair, transforming () to None, ().
So you can write slightly clearer code:
pair = cons(head, tail)
head, tail = snoc(pair)
'''
if pair: return pair
else: return (None, ())
def car(pair):
return pair[0] if pair else None
def cdr(pair):
return pair[1] if pair else ()
def cons_iter(stack):
'''
Iterate stack, e.g.
for item in cons_iter(stack): ...
'''
while stack:
head, stack = stack
yield head
May you write much LISP in Python.> Integers are likely the most used data type of any program, that means a lot of heap allocations.
I would guess strings come first, then floats, then booleans, and then integers. Are there any data available on that?
So integer definitely gets used more than strings, and I'd argue way more than floats in most programs.
But, otherwise, I'd agree that strings probably win, globally.
A char is just a machine integer with implementation specified signedness (crazy), bools are just machine integers which aren't supposed to have values other than 0 or 1, and the floating point types are just integers reinterpreted as binary fractions in a strange way.
Addresses are just machine integers of course, but pointers have provenance which means that it matters why you have the pointer, whereas for the machine integers their value is entirely determined by the bits making them up.
That's defect report #260 against the C language. One option for WG14 was to say "Oops, yeah, that should work in our language" and then modern C compilers have to be modified and many C programs are significantly slower. This gets the world you (and you're far from alone among C programmers) thought you lived in, though probably your C programs are slower than you expected now 'cos your "pointers" are now just address integers and you've never thought about the full consequences.
But they didn't, instead they wrote "They may also treat pointers based on different origins as distinct even though they are bitwise identical" because by then that is in fact how C compilers work. That's them saying pointers have provenance, though they do not describe (and neither does their ISO document) how that works.
There is currently a TR (I think, maybe wrong letters) which explains PNVI-ae-udi, Provenance Not Via Integers, Addresses Exposed, User Disambiguates which is the current preferred model for how this could possibly work. Compilers don't implement that properly either, but they could in principle so that's why it is seen as a reasonable goal for the C language. That TR is not part of the ISO standard but in principle one day it could be. Until then, provenance is just a vague shrug in C. What you said is wrong, and er... yeah that is awkward for everybody.
Rust does specify how this works. But the bad news (I guess?) for you is that it too says that provenance is a thing, so you cannot just go around claiming the address you dredged up from who knows where is a valid pointer, it ain't. Or rather, in Rust you can write out explicitly that you do want to do this, but the specification is clear that you get a pointer but not necessarily a valid pointer even if you expected otherwise.
[1] As an aside, the last time I tried to talk to a committee representative about undefined behaviour optimisation pitfalls, I was told that the standard does not prescribe optimisations. Which was quite puzzling, because it obviously prescribes compiler behaviour with the express goal of allowing certain optimisations. If I took that statement at face value, it would follow that undefined behaviour is not there for optimisation's sake, but rather as a fun feature to make programming more interesting...
It also doesn’t have UB for cargo cult reasons.
A pointer derived from only an integer was supposed to alias any allocation, but good luck discussing this with your optimizing C compiler.
A boolean is an integer.
I’ve found python optimization to be nearly intractable. I’ve spent a significant amount of time over the past two decades optimizing C, Java, Swift, Ruby, SQL and I’m sure more. The techniques are largely the same. In Python, however, everything seems expensive. Field lookup on an object, dynamic dispatch, string/array concatenation. After optimization, the code is no longer “pythonic” (which has come to mean slow, in my vernacular).
Are there any good resources on optimizing python performance while keeping idiomatic?
At the risk of sounding snarky and/or unhelpful, in my experience, the answer is that you don't try to optimize Python code beyond fixing your algorithm to have better big-O properties, followed by calling out to external code that isn't written in Python (e.g., NumPy, etc).
But, I'm a hater. I spent several years working with Python and hated almost every minute of it for various reasons. Very few languages repulse me the way Python does: I hate the syntax, the semantics, the difficulty of distribution, and the performance (memory and CPU, and is GIL disabled by default yet?!)...
That's impossible[1].
I think it is impossible because when i identify a slow function using cProfile then use dis.dis() on it to view the instructions executed, most of the overhead - by that i mean time spent doing something other than the calculation the code describes - is spent determining what each "thing" is. It's all trails of calls trying to determine "this thing can't be __add__() to that thing but maybe that thing can be __radd()__ to this thing instead. Long way to say most of the time wasting instructions i see can be attacked by providing ctypes or some other approach like that (mypyc, maybe cython, etc etc) - but at this point you're well beyond "idiomatic".
[1] I'm really curious to know the answer to your question so i'll post a confident (but hopefully wrong) answer so that someone feels compelled to correct me :-)
Branches cannot be reasoned about since they are all in the Cpython state machine.
Memory performance is about being efficient with allocations, cache locality, and pointer chasing.
Python has huge amounts of allocations and pointer chasing. Its by design slow.
It is essentially impossible, "essentially" here using not the modern sense of "mostly", but "essential" as in baked into the essence of the language. It has been the advice in the Python community pretty much since the beginning that the solution is to go to another language in that case. There are a number of solutions to the problem, ranging from trying PyPy, implementing an API in another language, Cython/Pyrex, up to traditional embedding of a C/C++ program into a Python module.
However this is one of those cases where there are a lot of solutions precisely because none of them are quite perfect and all have some sort of serious downside. Depending on what you are doing, you may find one whose downside you don't care about. But there's no simple, bullet-proof cookbook answer for "what do I do when Python is too slow even after basic optimization".
Python is fundamentally slow. Too many people still hear that as an attack on the language, rather than an engineering fact that needs to be kept in mind. Speed isn't everything, and as such, Python is suitable for a wide variety of tasks even so. But it is, still, fundamentally slow, and those who read that as an attack rather than an engineering assessment are more likely to find themselves in quite a pickle (pun somewhat intended) one day when they have a mass of Python code that isn't fast enough and no easy solutions for the problem than those who understand the engineering considerations in choosing Python.
Then just looking at those, I now know of Shedskin and ComPyler.
I do feel like one nice thing about so many people working on solutions to every problem in python, is that it means that when you do encounter a serious downside, you have a lot more flexibility to move forwards along a different path.
Python is a great language for rapidly bashing out algorithmic code, glue scripts, etc, unfortunately due to how dynamic it is, it is a language that fundamentally doesn't translate well to operations CPUs can perform efficiently. Hardly any python programs ever need to be as dynamic as what the language allows.
I've had very good experiences applying cython to python programs that need to do some kind of algorithmic number crunching, where numpy alone doesn't get the job done.
With cython you start with your python code and incrementally add static typing that reduces the layers of python interpreter abstractions and wrappings necessary. Cython has a very useful and amusing output where it spits out a html report of annotated cython source code, with lines highlighted in yellow in proportion to the amount of python overhead. you click on any line of python in that report and it expands to show how many Cpython API operations are required to implement it, and then you add more static type hints and recompile until the yellow goes away and the compute heavy kernel of your script is a C program that compiles to operations that real world CPUs can execute efficiently.
Downside of cython is the extra build toolchain and deployment concerns it drags in - if you previously had a pure python module, now you've got native modules so you need to bake platform specific wheels for each deployment target.
For Python that's being used as a glue scripting language, not number crunching, worth considering rewriting the script in something like Go. Go has a pretty good standard library to handle many tasks without needing to install many 3rd party packages, and the build, test, deploy story is very nice.
If what you're trying to do involves tasks that can be done in parallel, the multithreading (if I/O bound) or multiprocessing (if compute bound) libraries can be very useful.
If what you're doing isn't conducive to either, you probably need to rewrite at least the critical parts in something else.
But even thougt Python is very slow, it is still very popular. So the language itself must be very good in my view, otherwise fewer people would use it.
[1] https://github.com/thomasmueller/bau-lang/blob/main/doc/perf...
Move your performance critical kernels outside Python. Numpy, Pytorch, external databases, Golang microservice, etc.
It's in fact an extremely successful paradigm. Python doesn't need to be fast to be used in every domain (though async/nogil is still worth advancing to avoid idle CPUs).
If your Python software is spending 99% of its time doing X, you can rewrite X in Rust and now the Python software using that is faster, but if your Python software is spending 7% of its time doing X, and 5% doing Y, and 3% doing Z then even if you rewrite X, and Y and Z you've shaved off no more than 15% and so it's probably time to stop writing Python at all.
This is part of why Google moved to Go as I understand it.
Exactly, that's one case where the OSes completely differ. Just because the OSes share the same kernel, which prioritizes userspace compatibility, doesn't mean they are not a different OS. A kernel doesn't make an OS.
Or you can just put it on PyPI which is everywhere nowadays.
And that wasn't true in Ruby(?!)
The first thing to do when optimizing Python is to use a JIT (PyPy). Like, yes Python is slow if you don't allow a JIT.
Alternatively, go the other JIT-less direction but use C (and AOT language).
Otherwise, optimizing code in Python is the same as in any other language eg
Fuck, this made me feel old. It blows my mind that the learning path Rust then Python even exists.
When I started school in 2000, it was C, then C++, and then Java. Oh, and then compilers. It’s impossible for me not to think about allocation when doing anything, it’s just rooted in my mind.
The prominence of Python for those students makes sense to me, thirty years ago I'm sure a BASIC would have been fine too. It doesn't need to be elegant, or have good performance, it does matter that it's free, and that there's a lot of existing crap written in Python you can crib from.
(This plays hell on things like Valgrind or ASAN.)
zahlman•3mo ago
> Let’s take out the print statement and see if it’s just the addition:
Just FWIW: the assignment is not required to prevent optimizing out the useless addition. It isn't doing any static analysis, so it doesn't know that `range` is the builtin, and thus doesn't know that `i` is an integer, and thus doesn't know that `+` will be side-effect-free.
> Nope, it seems there is a pre-allocated list of objects for integers in the range of -5 -> 1025. This would account for 1025 iterations of our loop but not for the rest.
1024 iterations, because the check is for numbers strictly less than `_PY_NSMALLPOSINTS` and the value computed is `i + 1` (so, `1` on the first iteration).
Interesting. I knew of them only ranging up to 256 (https://stackoverflow.com/questions/306313).
It turns out (https://github.com/python/cpython/commit/7ce25edb8f41e527ed4...) that the change is barely a month old in the repository; so it's not in 3.14 (https://github.com/python/cpython/blob/3.14/Include/internal...) and won't show up until 3.15.
> Our script appears to actually be reusing most of the PyLongObject objects!
The interesting part is that it can somehow do this even though the values are increasing throughout the loop (i.e., to values not seen on previous iterations), and it also doesn't need to allocate for the value of `i` retrieved from the `range`.
> But realistically the majority of integers in a program are going to be less than 2^30 so why not introduce a fast path which skips this complicated code entirely?
This is the sort of thing where PRs to CPython are always welcome, to my understanding. It probably isn't a priority, or something that other devs have thought of, because that allocation presumably isn't a big deal compared to the time taken for the actual conversion, which in turn is normally happening because of some kind of I/O request. (Also, real programs probably do simple arithmetic on small numbers much more often than they string-format them.)
petters•3mo ago
I think that is mostly of historical interest. For example, it still does not support Python 3 and has not been updated in a very long time
zahlman•3mo ago
cogman10•3mo ago
[1] https://www.graalvm.org/python/