Can cross fertilization between PyPy and CPython JIT efforts help already fast PyPy to get even faster? Like, did CPython JIT team try something PyPy developers didn't attempt before?
PyPy is awesome, btw.
The biggest differences between the two JITs are: 1. PyPy is meta tracing, CPython is tracing 2. PyPy has "standard" code generation backends, CPython has copy&patch. 3. CPython so far uses "trace projection" while PyPy uses "trace recording".
(1+2) make CPython JIT much faster to compile and warmup than PyPy, although I suspect that most of the gain is because of (1). However, this comes at the expense of generality, because in PyPy you can automatically trace across all builtins, whereas in CPython you are limited to the bytecode.
Trace projection looked very interesting to me because it automatically solve a problem which I found everywhere in real world code: if you do trace recording, you don't know whether you will be actually able to close the loop, and so you must decide to give up after a certain threshold ("trace too long"). The problem is that there doesn't seem to be threshold which is generally good, so you always end up tracing too much (big warmup costs + literally you are doing unnecessary work) or not enough (the generated code is less optimal, sometimes up to 5-10x).
With trace projection you decide which loop to optimize "in retrospect" so you don't have that specific problem. However, you have OTHER problems (in particular that you don't know the actual values used in the trace) which makes it harder to optimize, so CPython JIT plans to switch to trace recording.
The problem of cpyext is that it's super slow, for good reasons: https://pypy.org/posts/2018/09/inside-cpyext-why-emulating-c...
There are efforts to create a new C API which is more friendly to alternative implementations (including CPython itself, when they want to change how they do things internally): https://hpyproject.org/ https://github.com/py-ni
Diagram: https://docs.julialang.org/en/v1/devdocs/img/compiler_diagra...
Documentation: https://docs.julialang.org/en/v1/devdocs/eval/
From what I understand, Julia doesn’t do any tracing at all, it just compiles each function based on the types it receives. Obviously Python doesn’t have multiple dispatch, but that actually might make compilation easier. Swap out the LLVM step with python's IR and they could probably expect a pretty substantial performance improvement. That said I don’t know anything about compilers, I just use both Python and Julia.
I'm not sure exactly how it differs from most JavaScript JITs, but I believe it just compiles each method once for each set of function argument types - for example, it doesn't try to dynamically determine the types of local variables.
One big advantage of tracing JITs is that they are generally easier to write an to maintain. For the specific case of PyPy, it's actually a "meta tracing JIT": you trace the interpreter, not the underlying program, which TL;DR means that you can write the interpreter (which is "easy") and you get a JIT compiler for free. The basic assumption of a tracing JIT is that you have one or more "hot loops" in which you have one (or few) fast paths which are taken most of the time.
If the assumption holds, tracing has big advantages because you eliminate most of dynamism and you automatically inline across multiple layer of function calls, which in turns make it possible to eliminate allocation of most temporary objects. The problem is that the assumption not always holds, and that's where you start to get problems.
But methods JITs are not THE solution either. Meta has a whole team developing Cinder, which is a method JIT for Python, but they had to introduce what they call "static python", which is an opt-in sets of constraints to remove some Python dynamism to make the JIT job easier.
Finally, as soon as you call any C extension, any JIT is out of luck and must deoptimize to present a "state of the world" which is compatible with that the C extension finds.
pizlonator•4mo ago
JS JITs (the production ones, like JSC's) have no such thing as trace blockers that prevent the surrounding code from being optimized. You might have an operation (like a call to some wacky native function) that is itself not optimized, but that won't have any impact on the JIT's ability to optimize the code surrounding that operation.
Tracing is just too much of a benchmark hack overall IMO. Tracing would only be a good idea in a world where it's too expensive to run a real optimizing JIT. But the JS, Java, and .NET experiences show that a real optimizing JIT - with all of its compile time costs - is exactly what you want because it results in predictable speed-ups
marky1991•4mo ago
pizlonator•4mo ago
When we talk about JS or Java JITs working well, we are making statements based on intense industry competition where if a JIT had literally any shortcoming then a competitor would highlight it in competitive benchmarking and blog posts. So, the competition forced aggressive improvements and created a situation where the top JITs deliver reliable perf across lots of workloads.
OTOH PyPy is awesome but just hasn’t had to face that kind of competitive challenge. So we probably can’t know how far off from JS JITs it is.
One thing I can say is when I compared it to JSC by writing the same benchmark in both Python and JS, JSC beat it by 4x or so.
acdha•4mo ago
cogman10•4mo ago
For example, static initialization on classes. The JDK has a billion different classes and on startup a not insignificant fraction of those end up getting loaded for all but the simplest applications.
Essentially, Java and the JS jits are both initially running everything interpreted and when a hot method is detected they progressively start spending the time sending those methods and their statistics to more aggressive JIT compilers.
A non-insignificant amount of time is being spent to try and make java start faster and a key portion of that is resolving the class loading problem.
acdha•4mo ago
cogman10•4mo ago
That's similar to how js does things.
Java does have a "client" optimization mode for more short lived operations (like guis for example) and AFAIK it's basically unused at this point. The more aggressive "server" optimizations are faster than ever and get triggered pretty aggressively now. The nature of the jvm is also changing. With fast scaling and containerization, a slow start and long warmup aren't good. That's why part of the jdk dev has been dedicated to resolve that.
pjmlp•4mo ago
All commercial JVMs have had JIT caches for quite some time, and this is finally also available as free beer on OpenJDK, thus code can execute right away as if it was an AOT language.
In some of those implementations, the JIT cache gets updated after each execution taking into account profiling data, thus we have the possibilitiy to reach an optimal status across the lifetime of the executable.
The .NET and ART cousins also have similar mechanisms in place.
Which I guess is what your last sentence refers to, but I wasn't sure.
cogman10•4mo ago
Yup, the CDS and now AOT stuff in openjdk is what I was referring to. Project Leyden.
IainIreland•4mo ago
cogman10•4mo ago
IainIreland•4mo ago
pizlonator•4mo ago
It’s true that there are some necessary pessimizations but nothing as severe as failing to optimize the code at all
ksec•4mo ago
But LuaJIT is also Tracing JIT which seems to work well enough.
pizlonator•4mo ago
Ive heard that LuaJIT has Pre stable perf than Mozilla’s tracing JIT had and I’ve heard plenty of stories about how flaky LuaJIT’s performance is. But we can’t know how good it really is due to lack of competitors