Typo: two "the"
For robotics/inverse pose applications, don't people usually use a 3x3 matrix (three rotations, three spatial) for coordinate representation? Otherwise you get weird gimbal lock issues (I think).
This post and interactive explanations have been on my backlog to read and internalize: https://thenumb.at/Exponential-Rotations/
(Also: Thanks for pointing out the typo, I just deployed a fix.)
> solicit “why don’t you just …” emails from experienced practitioners who can point me to the library/tutorial I’ve been missing =D (see the alternatives-considered down the page for what I struck out on)
You should look for "post training static quantization" also called . There are countless ways to quantize. This will quantize both the weights and the activations after training.
You're doing this on hard mode for no reason. This is typical and something I often need to break people out of. Optimizing for performance by doing custom things in Jax when you're a beginner is a terrible path to take.
Performance is not your problem. You're training a trivial network that would have run on a CPU 20 years ago.
There's no clear direction here, just trying complicated stuff in no logical order with no learning or dependencies between steps. You need to treat these problems as scientific experiments. What do I do to learn more about my domain, what do I change depending on the answer I get, etc. Not, now it's time to try something else random like jax.
Worse. You need to learn the key lesson in this space. Credit assignment for problems is extremely hard. If something isn't working why isn't it? Because of a bug? A hopeless problem? Using a crappy optimizer? Etc. That's why you should start in a framework that works and escape it later if you want.
Here's a simple plan to do this:
First forget about quantization. Use pytorch. Implement your trivial network in 5 lines. Train it with Adam. Make sure it works. Make sure your problem is solveable with the data that you have and the network you've chosen and your activation functions and the loss and the optimizer (use Adam, forget about this doing stuff by hand for now).
> Unless I had an expert guide who was absolutely sure it’d be straightforward (email me!), I’d avoid high-level frameworks like TensorFlow and PyTorch and instead implement the quantization-aware training myself.
This is exactly backwards. Unless you have an expert never implement anything yourself. If you don't have one, rely on what already exists. Because you can logically narrow down the options for what works and what's wrong. If you do it yourself you're always lost.
Once you have that working start backing off. Slowly change the working network into what you need. Step by step. At every step write down why you think your change is good and what you would do if it isn't. Then look at the results.
Forget about microflow-rs or whatever. Train with pytorch, export to onnx, generate c code for your onnx for inference.
Read the pytorch guide on PTSQ and use it.
1. You can, in fact, get rid of every FP instruction on M0. The trick is to pre‑bake the scale and zero_point into a single fixed‑point multiplier per layer (the dyadic form you mentioned). The formula is
ini Copy Edit y = ((Wx + b) M) >> s Where M fits in an int32 and s is the power‑of‑two shift. You compute M and s once on the host, write them as const tables, and your inner loop is literally a MAC followed by a multiply‑accumulate‑shift. No fpsoft library, no division.
2. CMSIS‑NN already gives you the fast int8 kernels. The docs are painful but you can steal just four files: arm_fully_connected_q7.c, arm_nnsupportfunctions.c, and their headers. On M0 this compiled to ~3 kB for me. Feed those kernels fixed‑point activations and you only pay for the ops you call.
3. Workflow that kept me sane
Prototype in PyTorch. Tiny net, ReLU, MSE, Adam, done.
torch.quantization.quantize_qat for quantization‑aware training. Export to ONNX, then run a one‑page Python script that dumps .h files with weight, bias, M, s.
Hand‑roll the inference loop in C. It is about 40 lines per layer, easy to unit‑test on the host with the same vectors you trained on.
By starting with a known‑good fp32 model you always have a checksum: the int8 path must match fp32 within tolerance or you know exactly where to look.
"A Neural Network in 11 lines of Python (Part 1)": https://iamtrask.github.io/2015/07/12/basic-python-network/
mattdesl•4h ago
https://arxiv.org/abs/2310.11453
https://github.com/cpldcpu/BitNetMCU/blob/main/docs/document...