frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Tiny Core Linux: a 23 MB Linux distro with graphical desktop

http://www.tinycorelinux.net/
136•LorenDB•2h ago•70 comments

HTML as an Accessible Format for Papers

https://info.arxiv.org/about/accessible_HTML.html
56•el3ctron•2h ago•24 comments

GrapheneOS is the only Android OS providing full security patches

https://grapheneos.social/@GrapheneOS/115647408229616018
53•akyuu•3h ago•8 comments

Touching the Elephant – TPUs

https://considerthebulldog.com/tte-tpu/
56•giuliomagnifico•4h ago•20 comments

Linux Instal Fest Belgrade

https://dmz.rs/lif2025_en
87•ubavic•6h ago•11 comments

Self-hosting my photos with Immich

https://michael.stapelberg.ch/posts/2025-11-29-self-hosting-photos-with-immich/
524•birdculture•6d ago•278 comments

A compact camera built using an optical mouse

https://petapixel.com/2025/11/13/this-guy-built-a-compact-camera-using-an-optical-mouse/
178•PaulHoule•3d ago•32 comments

Mapping Amazing: Bee Maps

https://maphappenings.com/2025/11/06/bee-maps/
16•altilunium•6d ago•5 comments

The unexpected effectiveness of one-shot decompilation with Claude

https://blog.chrislewis.au/the-unexpected-effectiveness-of-one-shot-decompilation-with-claude/
63•knackers•1w ago•32 comments

How I discovered a hidden microphone on a Chinese NanoKVM

https://telefoncek.si/2025/02/2025-02-10-hidden-microphone-on-nanokvm/
133•ementally•3h ago•37 comments

Cloudflare outage on December 5, 2025

https://blog.cloudflare.com/5-december-2025-outage/
723•meetpateltech•1d ago•527 comments

Z-Image: Powerful and highly efficient image generation model with 6B parameters

https://github.com/Tongyi-MAI/Z-Image
6•doener•6d ago•0 comments

The Absent Silence (2010)

https://www.ursulakleguin.com/blog/3-the-absent-silence
46•dcminter•4d ago•9 comments

Kids who ran away to 1960s San Francisco

https://www.fieldnotes.nautilus.quest/p/the-kids-who-ran-away-to-1960s-san
54•zackoverflow•3d ago•3 comments

PalmOS on FisherPrice Pixter Toy

https://dmitry.gr/?r=05.Projects&proj=27.%20rePalm#pixter
147•dmitrygr•13h ago•17 comments

Schizophrenia sufferer mistakes smart fridge ad for psychotic episode

https://old.reddit.com/r/LegalAdviceUK/comments/1pc7999/my_schizophrenic_sister_hospitalised_hers...
321•hliyan•9h ago•269 comments

Netflix to Acquire Warner Bros

https://about.netflix.com/en/news/netflix-to-acquire-warner-bros
1627•meetpateltech•1d ago•1237 comments

Gemini 3 Pro: the frontier of vision AI

https://blog.google/technology/developers/gemini-3-pro-vision/
506•xnx•1d ago•264 comments

Wolfram Compute Services

https://writings.stephenwolfram.com/2025/12/instant-supercompute-launching-wolfram-compute-services/
195•nsoonhui•9h ago•99 comments

Have I been Flocked? – Check if your license plate is being watched

https://haveibeenflocked.com/
247•pkaeding•13h ago•159 comments

Making tiny 0.1cc two stroke engine from scratch

https://youtu.be/nKVq9u52A-c?si=KVY6AK7tsudqnbJN
104•pillars•5d ago•26 comments

Leaving Intel

https://www.brendangregg.com/blog//2025-12-05/leaving-intel.html
299•speckx•19h ago•167 comments

Divine D native Linux open-source mobile system – Rev. 1.1 Hardware Architecture

https://docs.dawndrums.tn/blog/dd-rev1.1-arch/
34•wicket•4d ago•7 comments

Infracost (YC W21) is hiring Sr Node Eng to make $600B/yr cloud spend proactive

https://www.ycombinator.com/companies/infracost/jobs/Sr9rmHs-senior-product-engineer-node-js
1•akh•10h ago

Netflix’s AV1 Journey: From Android to TVs and Beyond

https://netflixtechblog.com/av1-now-powering-30-of-netflix-streaming-02f592242d80
515•CharlesW•1d ago•265 comments

Frinkiac – 3M "The Simpsons" Screencaps

https://frinkiac.com/
136•GlumWoodpecker•3d ago•44 comments

Patterns for Defensive Programming in Rust

https://corrode.dev/blog/defensive-programming/
300•PaulHoule•1d ago•76 comments

Why Speed Matters

https://lemire.me/blog/2025/12/05/why-speed-matters/
64•gsky•4h ago•28 comments

Idempotency keys for exactly-once processing

https://www.morling.dev/blog/on-idempotency-keys/
162•defly•5d ago•67 comments

The missing standard library for multithreading in JavaScript

https://github.com/W4G1/multithreading
115•W4G1•19h ago•31 comments
Open in hackernews

Towards the Cutest Neural Network

https://kevinlynagh.com/towards-the-cutest-neural-network/
121•surprisetalk•7mo ago

Comments

mattdesl•7mo ago
I wonder how well BitNet (ternary weights) would work for this. It seems like a promising way forward for constrained hardware.

https://arxiv.org/abs/2310.11453

https://github.com/cpldcpu/BitNetMCU/blob/main/docs/document...

gitroom•7mo ago
I gotta say, I'm always interested in new ways to make stuff lighter especially for small devices - you think these clever tricks actually hold up for real-world use or just look cool on paper?
Onavo•7mo ago
> since our input data comes from multiple sensors and the the output pose has six components (three spatial positions and three spatial rotations)

Typo: two "the"

For robotics/inverse pose applications, don't people usually use a 3x3 matrix (three rotations, three spatial) for coordinate representation? Otherwise you get weird gimbal lock issues (I think).

lynaghk•7mo ago
For my application I need just the translations and Euler angles. The range of poses is mechanically constrained so I don't have to worry about gimbal lock. But yeah, my limited understanding matches yours that other parameterizations are more useful in general contexts.

This post and interactive explanations have been on my backlog to read and internalize: https://thenumb.at/Exponential-Rotations/

(Also: Thanks for pointing out the typo, I just deployed a fix.)

01HNNWZ0MV43FF•7mo ago
Hey there op. I don't know what your sensors are measuring (distance to a point maybe? Or angle from a Valve lighthouse for inside-out tracking?)

But here's my "why didn't you just"

Since you have a forward simulation function (pose to measurements), why didn't you use an iterative solver to reverse it? Coordinate descent is easy to code and if you have a constrained range of poses you can probably just use multiple starting points to avoid getting stuck with a local minimum. Then use the last solution as a starting point for the next one to save iterations.

Sure it's not closed-form like an NN and it can still have pathological cases, but the code is a little more transparent

lynaghk•7mo ago
That's a reasonable idea, but unfortunately wouldn't work in my case since the simulation relies on a lot of scientific libraries in Python and I need the inversion to happen on the microcontroller.

When you say "coordinate descent" do you mean gradient descent? I.e., updating a potential pose using the gradient of a loss term (e.g., (predicted sensor reading - actual sensor reading)**2)?

I bet that would work, but a tricky part would be calculating gradients. I'm not sure if the Python libraries I'm using support that. My understanding is that automatic differentiation through libraries might be easier in a language like Julia where dual numbers flow through everything via the multiple dispatch mechanism.

01HNNWZ0MV43FF•7mo ago
Ah makes sense.

No, coordinate descent is a stupider gradient-optional method: https://en.wikipedia.org/wiki/Coordinate_descent

It's slow and sub-optimal, but the code is very easy to follow and you don't have to wonder whether your gradient is correct.

thomassmith65•7mo ago
What benefit does jax.nn provide over rolling one's own? There are countless examples on the web of small neural networks, written from scratch.
thih9•7mo ago
Could you point to an example that you like more? One of the author’s goals is to:

> solicit “why don’t you just …” emails from experienced practitioners who can point me to the library/tutorial I’ve been missing =D (see the alternatives-considered down the page for what I struck out on)

light_hue_1•7mo ago
All of this is absurdly complicated. Exactly what I would expect from a new student who doesn't know what they're doing and has no one to teach them how do you engineering in a systematic manner. I don't mean this as an insult. I teach this stuff and have seen it hundreds of times.

You should look for "post training static quantization" also called . There are countless ways to quantize. This will quantize both the weights and the activations after training.

You're doing this on hard mode for no reason. This is typical and something I often need to break people out of. Optimizing for performance by doing custom things in Jax when you're a beginner is a terrible path to take.

Performance is not your problem. You're training a trivial network that would have run on a CPU 20 years ago.

There's no clear direction here, just trying complicated stuff in no logical order with no learning or dependencies between steps. You need to treat these problems as scientific experiments. What do I do to learn more about my domain, what do I change depending on the answer I get, etc. Not, now it's time to try something else random like jax.

Worse. You need to learn the key lesson in this space. Credit assignment for problems is extremely hard. If something isn't working why isn't it? Because of a bug? A hopeless problem? Using a crappy optimizer? Etc. That's why you should start in a framework that works and escape it later if you want.

Here's a simple plan to do this:

First forget about quantization. Use pytorch. Implement your trivial network in 5 lines. Train it with Adam. Make sure it works. Make sure your problem is solveable with the data that you have and the network you've chosen and your activation functions and the loss and the optimizer (use Adam, forget about this doing stuff by hand for now).

> Unless I had an expert guide who was absolutely sure it’d be straightforward (email me!), I’d avoid high-level frameworks like TensorFlow and PyTorch and instead implement the quantization-aware training myself.

This is exactly backwards. Unless you have an expert never implement anything yourself. If you don't have one, rely on what already exists. Because you can logically narrow down the options for what works and what's wrong. If you do it yourself you're always lost.

Once you have that working start backing off. Slowly change the working network into what you need. Step by step. At every step write down why you think your change is good and what you would do if it isn't. Then look at the results.

Forget about microflow-rs or whatever. Train with pytorch, export to onnx, generate c code for your onnx for inference.

Read the pytorch guide on PTSQ and use it.

revskill•7mo ago
Well said. Thanks.
omneity•7mo ago
Despite the tone this is excellent advice! I had similar impressions reading the article and was wondering if I missed something.
seletskiy•7mo ago
I kind of see your point, but only in the context of working on time-sensitive task which others rely upon. But if it is hobby/educational project, what is wrong doing things by yourself? And resort to decomposing existing solution if you can't figure out why yours is not working?

There's nothing better for understanding something rather than trying to do that "something" from scratch yourself.

bubblyworld•7mo ago
I think the point is that OP is learning things about a wide variety of topics that aren't really relevant to their stated goal, i.e. solving the sensor/state inference problem.

Which, as you say, can be valuable! There's nothing wrong with that. But the more complexity you add the less likely you are to actually solve the problem (all else being equal, some problems are just inherently complex).

jasonjmcghee•7mo ago
Targeting ONNX and using something like https://github.com/kraiskil/onnx2c as parent mentioned is good advice.
JanSchu•7mo ago
Nice write‑up. A couple of notes from doing roughly the same dance on Cortex‑M0 and M3 boards for sensor fusion.

1. You can, in fact, get rid of every FP instruction on M0. The trick is to pre‑bake the scale and zero_point into a single fixed‑point multiplier per layer (the dyadic form you mentioned). The formula is

ini Copy Edit y = ((Wx + b) M) >> s Where M fits in an int32 and s is the power‑of‑two shift. You compute M and s once on the host, write them as const tables, and your inner loop is literally a MAC followed by a multiply‑accumulate‑shift. No fpsoft library, no division.

2. CMSIS‑NN already gives you the fast int8 kernels. The docs are painful but you can steal just four files: arm_fully_connected_q7.c, arm_nnsupportfunctions.c, and their headers. On M0 this compiled to ~3 kB for me. Feed those kernels fixed‑point activations and you only pay for the ops you call.

3. Workflow that kept me sane

Prototype in PyTorch. Tiny net, ReLU, MSE, Adam, done.

torch.quantization.quantize_qat for quantization‑aware training. Export to ONNX, then run a one‑page Python script that dumps .h files with weight, bias, M, s.

Hand‑roll the inference loop in C. It is about 40 lines per layer, easy to unit‑test on the host with the same vectors you trained on.

By starting with a known‑good fp32 model you always have a checksum: the int8 path must match fp32 within tolerance or you know exactly where to look.

lynaghk•7mo ago
Awesome, thanks! This is exactly the kind of experienced take I was hoping my blog post would summon =D

Re: computing M and s, does torch.quantization.quantize_qat do this or do you do it yourself from the (presumably f32) activation scaling that torch finds?

I don't have much experience with this kind of numerical computing, so I have no intuition about how much the "quantization" of selecting M and s might impact the overall performance of the network. I.e., whether

- M and s should be trained as part of QAT (e.g., the "Learned Step Size Quantization" paper)

- it's fine to just deterministically compute M and s from the f32 activation scaling.

Also: Thanks for the tips re: CMSIS-NN, glad to know it's possible to use in a non-framework way. Any chance your example is open source somewhere?

alexpotato•7mo ago
If people like "give me the simplest possible working coding example" of neural networks, I highly recommend this one:

"A Neural Network in 11 lines of Python (Part 1)": https://iamtrask.github.io/2015/07/12/basic-python-network/

QuadmasterXLII•7mo ago
Experienced practitioner here, the second half pf the post describes doing everything exactly the way I have done it (only differences are I picked C++ and Eigen instead of rust and nalgebrafor inference, and i used torch’s ndarray and backprop tools instead of jax’s- with the analagous “just print out a C++ code from python” approach to weight serialization). You picked up on the key insight which is that the size of the code needed to just directly implement the inference equations is much smaller than the size of the configuration file of any possible framework that was flexible enough to meet your requirements of (rust, no inference time allocation, no inference time floating point, trained from scratch, ultra small parameter count, …)
hansvm•7mo ago
The last time I did anything like this, the easiest workflow I found was to use your favorite high-level runtime for training and just implement a serializer converting the model into source code for your target embedded system. Hand-code the inference loop. This is exactly the strategy TFA landed on.

One advantage of having it implemented in code is that you can observe and think about the instructions being generated. TFA didn't talk at all about something pretty important for small/fast neural networks -- the normal "cleanup" code (padding, alignment, length alignment, data-dependent horizontal sums, etc) can dwarf the actual mul->add execution times. You might want to, e.g., ensure your dimensions are all multiples of 8. You definitely want to store weights as column-major instead of row-major if the network is written as vec @ mat instead of mat @ vec (and vice versa for the latter).

When you're baking weights and biases into code like that, use an affine representation -- explicitly pad the input with the number one, along with however many extra zeroes you need for any other length padding requirements make sense for your problem (usually zero for embedded, but this is a similar workflow to low-resource networks on traditional computers, where you probably want vectorization).

Floats are a tiny bit hard to avoid for dot products. For similar precision, you require nearly twice the bit count in a fixed-point representation just to make the multiplies work, plus some extra bits proportional to the log2 of the dimension. E.g., if you trained on f16 inputs then you'll have roughly comparable precision with i32 fixed-point weights, and that's assuming you go through the effort to scale and shift everything into an appropriate numerical regime. Twice the instruction count (or thereabouts) on twice the register width makes fixed-point 2-4x slower for similar precision than a hardware float, supposing those wide instructions exist for your microcontroller, and soft floats are closer to 10x slower for multiply-accumulate. If you're emulating wide integer instructions, just use soft floats. If you don't care about a 4x slowdown, just use soft floats.

Training can be a little finicky for small networks. At a minimum, you probably want to create train/test/validate sets and have many training runs. There are other techniques if you want to go down a rabbit hole.

Other ML architectures can be much more performant here. Gradient-boosted trees are already SOTA on many of these problems, and oblivious trees map extremely well to normal microcontroller instruction sets. By skipping the multiplies, your fixed-point precision is on par with floats of similar bit-width, making quantization a breeze.

imtringued•7mo ago
This is such a confused blogpost I swear this had to be a teenager just banging their head against the wall.

Wanting to natively train a quantized neural network is stupid unless you are training directly on your microcontroller. I was constantly waiting for the author to explain their special circumstances and it turns out they don't have any. They just have a standard TinyML [0] use case that's been done to death with fixed point quanitization aware training, which unlike what the author of the blog post said, doesn't rely on terabytes of data.

QAT is done on a conventionally trained model with much less data than the full training process. Doing QAT early has no benefits. The big downside of QAT isn't that you need a lot of data, it's that you need the same data distribution as the original training data and nobody has access to that, because only the weights are published.

[0] https://medium.com/@thommaskevin/tinyml-quantization-aware-t...

jasonjmcghee•7mo ago
Out of curiosity, did you consider bayesian state estimators?

For example, an unscented kalman filter: https://www.mathworks.com/help/control/ug/nonlinear-state-es...

nico•7mo ago
Great article. For a moment, I thought this would be about a gen AI that would turn any input into a “kawai” version of it

Anyway, excellent insights and detail

gwern•7mo ago
My suggestion would be that, since you want a tiny integer-only NN tailored for a specific computer, are only occasionally training one for a specific task, and you have a simulator to generate unlimited data, you simply do random search or an evolutionary method like CMA-ES.

They are easy to understand and code up by hand in a few lines (which is one reason you won't find any libraries for them - they are the 'leftpad' or 'isEven' of NNs, the effort it would take to install and understand and use a library often exceeds what it would take to just write it yourself), will handle any NN topology or numeric type you can invent, and will train very fast in this scenario.