I’m sharing an early benchmark: a GPIO test where PyXL achieves a 480ns round-trip toggle — compared to 14-25 micro seconds on a MicroPython Pyboard - even though PyXL runs at a lower clock (100MHz vs. 168MHz).
The design is stack-based, fully pipelined, and preserves Python's dynamic typing without static type restrictions. I independently developed the full stack — toolchain (compiler, linker, codegen), and hardware — to validate the core idea. Full technical details will be presented at PyCon 2025.
Demo and explanation here: https://runpyxl.com/gpio Happy to answer any questions
* Could you share the assembly language of the processor?
* What is the benefit of designing the processor and making a Python bytecode compiler for it, vs making a bytecode compiler for an existing processor such as ARM/x86/RISCV?
HDL: Verilog
Assembly: The processor executes a custom instruction set called PySM (Not very original name, I know :) ). It's inspired by CPython Bytecode — stack-based, dynamically typed — but streamlined to allow efficient hardware pipelining. Right now, I’m not sharing the full ISA publicly yet, but happy to describe the general structure: it includes instructions for stack manipulation, binary operations, comparisons, branching, function calling, and memory access.
Why not ARM/X86/etc... Existing CPUs are optimized for static, register-based compiled languages like C/C++. Python’s dynamic nature — stack-based execution, runtime type handling, dynamic dispatch — maps very poorly onto conventional CPUs, resulting in a lot of wasted work (interpreter overhead, dynamic typing penalties, reference counting, poor cache locality, etc.).
I’m guessing due to the lack of JIT, it’s executed on the host?
If you refer to the ARM part as the host (did you?) it's just orchestrating the whole thing, it doesn't run the actual Python program
Maybe not AWS Lambda specifically, but definitely server-side acceleration — especially for machine learning feature generation, backend control logic, and anywhere pure Python becomes a bottleneck.
It could definitely get there — but it would require building a full-scale deployment model and much broader library and dynamic feature support.
That said, the underlying potential is absolutely there.
What's missing so you could create a demo for vc's or the relevant companies , proving the potential of this as competitive server-class core ?
PyXL today is aimed more at embedded and real-time systems.
For server-class use, I'd need to mature heap management, add basic concurrency, a simple network stack, and gather real-world benchmarks (like requests/sec).
That said, I wouldn’t try to fully replicate CPython for servers — that's a very competitive space with a huge surface area.
I'd rather focus on specific use cases where deterministic, low-latency Python execution could offer a real advantage — like real-time data preprocessing or lightweight event-driven backends.
When I originally started this project, I was actually thinking about machine learning feature generation workloads — pure Python code (branches, loops, dynamic types) without heavy SIMD needs. PyXL is very well suited for that kind of structured, control-flow-heavy workload.
If I wanted to pitch PyXL to VCs, I wouldn’t aim for general-purpose servers right away. I'd first find a specific, focused use case where PyXL's strengths matter, and iterate on that to prove value before expanding more broadly.
I'd love to read about the design process. I think the idea of taking bytecode aimed at the runtime of dynamic languages like Python or Ruby or even Lisp or Java and making custom processors for that is awesome and (recently) under-explored.
I'd be very interested to know why you chose to stay this, why it was a good idea, and how you went about the implementation (in broad strokes if necessary).
There are definitely some limitations beyond just memory or OS interaction. Right now, PyXL supports a subset of real Python. Many features from CPython are not implemented yet — this early version is mainly to show that it's possible to run Python efficiently in hardware. I'd prefer to move forward based on clear use cases, rather than trying to reimplement everything blindly.
Also, some features (like heavy runtime reflection, dynamic loading, etc.) would probably never be supported, at least not in the traditional way, because the focus is on embedded and real-time applications.
As for the design process — I’d love to share more! I'm a bit overwhelmed at the moment preparing for PyCon, but I plan to post a more detailed blog post about the design and philosophy on my website after the conference.
Is it tied to a particular version of python?
Right now, PyXL is tied fairly closely to a specific CPython version's bytecode format (I'm targeting CPython 3.11 at the moment).
That said, the toolchain handles translation from Python source → CPython bytecode → PyXL Assembly → hardware binary, so in principle adapting to a new Python version would mainly involve adjusting the frontend — not reworking the hardware itself.
Longer term, the goal is to stabilize a consistent subset of Python behavior, so version drift becomes less painful.
having this, the next tempting step is to make `print` function work, then the filesystem wrapper etc.
btw - what i'm missing is a clear information of limitations. it's definitely not true that i can take any Python snippet and run it using PyXL (for example threads i suppose?)
Peripheral drivers (like UART, SPI, etc.) are definitely on the roadmap - They'd obviously be implemented in HW. You're absolutely right — once you have basic IO, you can make things like print() and filesystem access feel natural.
Regarding limitations: you're right again. PyXL currently focuses on running a subset of real Python — just enough to show it's real python and to prove the core concept, while keeping the system small and efficient for hardware execution. I'm intentionally holding off on implementing higher-level features until there's a real use case, because embedded needs can vary a lot, and I want to keep the system tight and purpose-driven.
Also, some features (like threads, heavy runtime reflection, etc.) will likely never be supported — at least not in the traditional way — because PyXL is fundamentally aimed at embedded and real-time applications, where simplicity and determinism matter most.
This is still an early-stage project — it's not completed yet, and fabricating a custom chip would involve huge costs.
I'm a solo developer worked on this in my spare time, so FPGA was the most practical way to prove the core concepts and validate the architecture.
Longer term, I definitely see ASIC fabrication as the way to unlock PyXL’s full potential — but only once the use case is clear and the design is a little more mature.
I find the idea of a processor designed for a specific very high level language quite interesting. What made you choose python and do you think it's the "correct" language for such a project? It sure seems convenient as a language but I wouldn't have thought it is best suited for that task due to the very dynamic nature of it. Perhaps something like Nim which is similar but a little less dynamic would be a better choice?
This is a great approach for many applications, but it doesn’t fit all use cases.
PyXL is a hardware solution — a custom processor designed specifically to run Python programs directly.
It's currently focused on embedded and real-time environments where JIT compilation isn't a viable option due to memory constraints, strict timing requirements, and the need for deterministic behavior.
> No VM, No C, No JIT. Just PyXL.
Is the main goal to achive C-like performance with the ease of writing python? Do you have a perfomance comparision against C? Is the main challenge the memory management?
> PyXL runs on a Zynq-7000 FPGA (Arty-Z7-20 dev board). The PyXL core runs at 100MHz. The ARM CPU on the board handles setup and memory, but the Python code itself is executed entirely in hardware. The toolchain is written in Python and runs on a standard development machine using unmodified CPython.
> PyXL skips all of that. The Python bytecode is executed directly in hardware, and GPIO access is physically wired to the processor — no interpreter, no function call, just native hardware execution.
Did you write some sort of emulation to enable testing it without the physical Arty board?
Performance comparison against C: I don't have a formal benchmark directly against C yet. The early GPIO benchmark (480ns toggle) is competitive with hand-written C on ARM microcontrollers — even when running at a lower clock speed. But a full systematic comparison (across different workloads) would definitely be interesting for the future.
Main challenge: Yes — memory management is one of the biggest challenges. Dynamic memory allocation and garbage collection are tricky to manage efficiently without breaking real-time guarantees. I have a roadmap for it, but would like to stick to a real use case before moving forward.
Software emulation: I am using Icarus (could use Verilator) for RTL simulation if that's what you meant. But hardware behavior (like GPIO timing) still needs to be tested on the real FPGA to capture true performance characteristics.
It could probably be adapted into one.
But I ultimately decided to build it as its own clean design because I wanted the flexibility to rethink the entire execution model for Python — not just adapt an existing register-based architecture.
FPGA is for prototyping. although this could probably be used as a soft core. But looking forward, ASIC is definitely the way to go.
fun :-)
but did I get it right?
It's a custom stack-based hardware processor tailored for executing Python programs directly. Instead of traditional microcode, it uses a Python-specific instruction set (PySM) that hardware executes.
The toolchain compiles Python → CPython Bytecode → PySM Assembly → hardware binary.
- CPython Bytecode is far from stable; it changes every version, sometimes changing the behaviour of existing bytecodes. As a result, you are pinned to a specific version of Python unless you make multiple translators.
- CPython Bytecode is poorly documented, with some descriptions being misleading/incorrect.
- CPython Bytecode requires restoring the stack on exception, since it keeps a loop iterator on the stack instead of in a local variable.
I recommend instead doing CPython AST → PySM Assembly. CPython AST is significantly more stable.
That then makes me wonder if someone could implement Excel in hardware! (Or something like it)
I have seen people do that for closed-source software that is distributed to end-users, because it makes reverse engineering and modding (a bit) more complicated.
Those, at runtime (and, nowadays, optionally also at compile time), convert that to native code. Python doesn’t; it runs a bytecode interpreter.
Reason Python doesn’t do that is a mix of lack of engineering resources, desire to keep the implementation fairly simple, and the requirement of backwards compatibility of C code calling into Python to manipulate Python objects.
- The vast majority of code generation would have to be dynamic dispatches, which would not be too different from CPython's bytecode.
- Types are dynamic; the methods on a type can change at runtime due to monkey patching. As a result, the compiler must be able to "recompile" a type at runtime (and thus, you cannot ship optimized target files).
- There are multiple ways every single operation in Python might be called; for instance `a.b` either does a __dict__ lookup or a descriptor lookup, and you don't know which method is used unless you know the type (and if that type is monkeypatched, then the method that called might change).
A JIT compiler might be able to optimize some of these cases (observing what is the actual type used), but a JIT compiler can use the source file/be included in the CPython interpreter.
I'd add that even beyond types, late binding is fundamental to Python’s dynamism: Variables, functions, and classes are often only bound at runtime, and can be reassigned or modified dynamically.
So even if every object had a type annotation, you would still need to deal with names and behaviors changing during execution — which makes traditional static compilation very hard.
That’s why PyXL focuses more on efficient dynamic execution rather than trying to statically "lock down" Python like C++.
If you define it as trying to compile Python in such a way that you would get the ability to do optimizations and get performance boosts and such, you end up at PyPy. However that comes with its own set of tradeoffs to get that performance. It can be a good set of tradeoffs for a lot of projects but it isn't "free" speedup.
For example if you compile x + y in C, you'll get a few clean instructions that add the data types of x and y. But if you compile this thing in some sort of Python compiler it would essentially have to include the entire Python interpreter; because it can't know what x and y are at compile time, there necessarily has to be some runtime logic that is executed to unwrap values, determine which "add" to call, and so forth.
If you don't want to include the interpreter, then you'll have to add some sort of static type checker to Python, which is going to reduce the utility of the language and essentially bifurcate it into annotated code you can compile, and unannotated code that must remain interpreted at runtime that'll kill your overall performance anyway.
That's why projects like Mojo exist and go in a completely different direction. They are saying "we aren't going to even try to compile Python. Instead we will look like Python, and try to be compatible, but really we can't solve these ecosystem issues so we will create our own fast language that is completely different yet familiar enough to try to attract Python devs."
Optimized libraries (e.g. numpy, Pandas, Polars, lxml, ...) are the idiomatic way to speed up "the parts that don't need to be in pure Python." Python subsets and specializations (e.g. PyPy, Cython, Numba) fill in some more gaps. They often use much tighter, stricter memory packing to get their speedups.
For the most part, with the help of those lower-level accelerations, Python's fast enough. Those who don't find those optimizations enough tend to migrate to other languages/abstractions like Rust and Julia because you can't do full Python without the (high and constant) cost of memory access.
So, no using C libraries. That takes out a huge chunck of pip packages...
That said, in future designs, PyXL could work in tandem with a traditional CPU core (like ARM or RISC-V), where C libraries execute on the CPU side and interact with PyXL for control flow and Python-level logic.
There’s also a longer-term possibility of compiling C directly to PyXL’s instruction set by building an LLVM backend — allowing even tighter integration without a second CPU.
Right now the focus is on making native Python execution viable and efficient for real-time and embedded systems, but I definitely see broader hybrid models ahead.
This is amazing, great work!
I think building a CPU that can only do this is a really novel idea and am really interested in seeing when you eventually disclose more implementation details. My only complaint is that it isn't Lua :P
- Lisp/lispm
- Ada/iAPX
- C/ARM
- Java/Jazelle
Most don't really take off or go in different directions as the language goes out of fashion.
https://mn416.github.io/reduceron-project/
These range from a few instructions to accelerate certain operations, to marking memory for the garbage collector, to much deeper efforts.
Historically their performance is underwhelming. Sometimes competitive on the first iteration, sometimes just mid. But generally they can't iterate quickly (insufficient resources, insufficient product demand) so they are quickly eclipsed by pure software implementations atop COTS hardware.
This particular Valley of Disappointment is so routine as to make "let's implement this in hardware!" an evergreen tarpit idea. There are a few stunning exceptions like GPU offload—but they are unicorns.
Very cool project still
As for the future, I’m keeping an open mind. It would be exciting if it grew into something bigger, but my main focus for now is making sure the foundation is as solid and clean as possible.
In theory, you could build a CPU that directly interprets Python bytecode — but Python bytecode is quite high-level and irregular compared to typical CPU instructions. It would add a lot of complexity and make pipelining much harder, which would hurt performance, especially for real-time or embedded use.
By compiling the Python bytecode ahead of time into a simpler, stack-based ISA (what I call PySM), the CPU can stay clean, highly pipelined, and efficient. It also opens the door in the future to potentially supporting other languages that could target the same ISA!
Every time I see a project that has a great implementation on an FPGA, I lament the fact that Tabula didn’t make it, a truly innovative and fast FPGA.
>Standard_NP10s instance, 1x AMD Alveo U250 FPGA (64GB)
Would be curious to see how this benchmarks on a faster FGPA since I imagine clock frequency is the latency dictator - while memory and tile can determine how many instances can run in parallel.
I'm interested to see whether the final feature set will be larger than what you'd get by creating a type-safe language with a pythonic syntax and compiling that to native, rather than building custom hardware.
The background garbage collection thing is easier said than done, but I'm talking to someone who has already done something impressively difficult, so...
hwpythonner•3h ago