frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

France's homegrown open source online office suite

https://github.com/suitenumerique
1•nar001•58s ago•0 comments

SpaceX Delays Mars Plans to Focus on Moon

https://www.wsj.com/science/space-astronomy/spacex-delays-mars-plans-to-focus-on-moon-66d5c542
1•BostonFern•1m ago•0 comments

Jeremy Wade's Mighty Rivers

https://www.youtube.com/playlist?list=PLyOro6vMGsP_xkW6FXxsaeHUkD5e-9AUa
1•saikatsg•1m ago•0 comments

Show HN: MCP App to play backgammon with your LLM

https://github.com/sam-mfb/backgammon-mcp
1•sam256•3m ago•0 comments

AI Command and Staff–Operational Evidence and Insights from Wargaming

https://www.militarystrategymagazine.com/article/ai-command-and-staff-operational-evidence-and-in...
1•tomwphillips•3m ago•0 comments

Show HN: CCBot – Control Claude Code from Telegram via tmux

https://github.com/six-ddc/ccbot
1•sixddc•4m ago•1 comments

Ask HN: Is the CoCo 3 the best 8 bit computer ever made?

1•amichail•7m ago•0 comments

Show HN: Convert your articles into videos in one click

https://vidinie.com/
1•kositheastro•9m ago•0 comments

Red Queen's Race

https://en.wikipedia.org/wiki/Red_Queen%27s_race
2•rzk•10m ago•0 comments

The Anthropic Hive Mind

https://steve-yegge.medium.com/the-anthropic-hive-mind-d01f768f3d7b
2•gozzoo•12m ago•0 comments

A Horrible Conclusion

https://addisoncrump.info/research/a-horrible-conclusion/
1•todsacerdoti•12m ago•0 comments

I spent $10k to automate my research at OpenAI with Codex

https://twitter.com/KarelDoostrlnck/status/2019477361557926281
2•tosh•13m ago•0 comments

From Zero to Hero: A Spring Boot Deep Dive

https://jcob-sikorski.github.io/me/
1•jjcob_sikorski•14m ago•0 comments

Show HN: Solving NP-Complete Structures via Information Noise Subtraction (P=NP)

https://zenodo.org/records/18395618
1•alemonti06•19m ago•1 comments

Cook New Emojis

https://emoji.supply/kitchen/
1•vasanthv•22m ago•0 comments

Show HN: LoKey Typer – A calm typing practice app with ambient soundscapes

https://mcp-tool-shop-org.github.io/LoKey-Typer/
1•mikeyfrilot•24m ago•0 comments

Long-Sought Proof Tames Some of Math's Unruliest Equations

https://www.quantamagazine.org/long-sought-proof-tames-some-of-maths-unruliest-equations-20260206/
1•asplake•25m ago•0 comments

Hacking the last Z80 computer – FOSDEM 2026 [video]

https://fosdem.org/2026/schedule/event/FEHLHY-hacking_the_last_z80_computer_ever_made/
2•michalpleban•26m ago•0 comments

Browser-use for Node.js v0.2.0: TS AI browser automation parity with PY v0.5.11

https://github.com/webllm/browser-use
1•unadlib•27m ago•0 comments

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

https://www.nytimes.com/2026/02/07/magazine/michael-pollan-interview.html
2•mitchbob•27m ago•1 comments

Software Engineering Is Back

https://blog.alaindichiappari.dev/p/software-engineering-is-back
2•alainrk•28m ago•1 comments

Storyship: Turn Screen Recordings into Professional Demos

https://storyship.app/
1•JohnsonZou6523•28m ago•0 comments

Reputation Scores for GitHub Accounts

https://shkspr.mobi/blog/2026/02/reputation-scores-for-github-accounts/
2•edent•32m ago•0 comments

A BSOD for All Seasons – Send Bad News via a Kernel Panic

https://bsod-fas.pages.dev/
1•keepamovin•35m ago•0 comments

Show HN: I got tired of copy-pasting between Claude windows, so I built Orcha

https://orcha.nl
1•buildingwdavid•35m ago•0 comments

Omarchy First Impressions

https://brianlovin.com/writing/omarchy-first-impressions-CEEstJk
2•tosh•41m ago•1 comments

Reinforcement Learning from Human Feedback

https://arxiv.org/abs/2504.12501
7•onurkanbkrc•41m ago•0 comments

Show HN: Versor – The "Unbending" Paradigm for Geometric Deep Learning

https://github.com/Concode0/Versor
1•concode0•42m ago•1 comments

Show HN: HypothesisHub – An open API where AI agents collaborate on medical res

https://medresearch-ai.org/hypotheses-hub/
1•panossk•45m ago•0 comments

Big Tech vs. OpenClaw

https://www.jakequist.com/thoughts/big-tech-vs-openclaw/
1•headalgorithm•48m ago•0 comments
Open in hackernews

Bit is all we need: binary normalized neural networks

https://arxiv.org/abs/2509.07025
101•PaulHoule•4mo ago

Comments

modeless•4mo ago
> each parameter exists in two forms simultaneously during training: a full-precision 32-bit floating-point value (p) used for gradient updates, and its binarized counterpart (pb) used for forward computations

So this is only for inference. Also activations aren't quantized, I think?

nighthawk454•4mo ago
Yeah, but it’s ’quantization aware’ during training too, which presumably is what allows the quantization at inference to work
benob•4mo ago
I wonder if one could store only the binary representation at training and sample a floating point representation (both weights and gradient) during backprop.
adastra22•4mo ago
Back propagation on random data that is then thrown away would be pretty useless.
jampekka•4mo ago
> Also activations aren't quantized, I think?

The very last conclusion: "Future work will focus on the implementation of binary normalization layers using single-bit arrays operations, as well as on quantizing layer activations to 8 or 16-bit precision. These improvements are expected to further enhance the efficiency and performance of the binary neural network models."

kouteiheika•4mo ago
You don't necessarily have to store the parameters in fp32 for gradient updates; I experimented with it and got it working (all parameter full fine-tuning) with parameters being as low as 3-bit (a little bit more than 3-bit, because the block-wise scales were higher precision), which is essentially as low as you can go before "normal" training starts breaking down.
noosphr•4mo ago
Yes, that's been the downside of these forever.

If you use quantized differentiation you can get away with using integers for gradient updates. Explaining how takes a paper and in the end it doesn't even work very well.

At university, way back at the end of the last AI winter, I ended up using genetic algorithms to train the models. It was very interesting because weights were trained along with hyper parameters. It was no where near practical because gradient descent is so much better at getting real world results in reasonable time frames - surprisingly because it's more memory efficient.

AmazingTurtle•4mo ago
I'm gonna refer to this one here: https://news.ycombinator.com/item?id=45361007
s20n•4mo ago
Attention Is All You Need - The Beatles ft. Charlie Puth
JKCalhoun•4mo ago
b/w "All You Need is Love".
amelius•4mo ago
Attention is all Google needs. Apparently.

I'm sick of BigTech fighting for my attention.

mnky9800n•4mo ago
Yes calling your paper this now makes me think your paper has no interesting results. It is kind of the opposite of what’s intended.
gloomyday•4mo ago
This naming trend has been going for 8 years. Incredible.
IshKebab•4mo ago
It's on my naughty list, together with "... considered harmful", "The unreasonable effectiveness of ...", "... for fun and profit", "Falsehoods programmers believe about ...", "The rise and fall of ...".
fxtentacle•4mo ago
These techniques are not new. And the reason why they’re usually not used is on page 9 in the paper. They require about 10x as many training iterations.
typpilol•4mo ago
Yea I saw that training perplexity and thought hmmm...
shomp•4mo ago
Turns out using floats is a feature and not a bug?
Dylan16807•4mo ago
No, I don't think so, in that I don't think anyone has ever called that a bug.
shomp•4mo ago
In the paper summary they did not call it a bug explicitly, but they do say there are 32x improvements in using single bits instead.
reactordev•4mo ago
To memory, sure. At the cost of 32x slower speeds.
Dylan16807•4mo ago
That's an obvious exaggeration. The competition is using smaller weights already, some of which are floating point and some of which aren't.

And they use full size floats for training.

imtringued•4mo ago
That means their paper is actually worse than SOTA, which is concerned with training in fp4 natively without full precision [0] for QAT.

[0] "full precision" in ML usually means 16 bit floats like bfloat16

Dylan16807•4mo ago
I wouldn't say "worse". It's focusing on inference cost and leaving training at a default for now.
personalityson•4mo ago
Unless each iteration is 90% faster
amelius•4mo ago
This.

In fact, it can be slower because hardware is probably not optimized for the 1-bit case, so there may be a lot of low-hanging fruit for hardware designers and we may see improvements in the next iteration of hardware.

nlitened•4mo ago
Isn't digital (binary) hardware literally optimized for 1-bit case by definition?
reactordev•4mo ago
People are confusing word size…

The CPU can handle up to word size bits at once. I believe they mean that a lot of assembly was written for integer math and not bit math. Word size 4+ However, it is unlikely we’ll see improvements in this area because by definition, using 64-bit floats uses max word size. So… that’s the max throughput. Sending 1 bit vs 64 bits would be considerably slower so this entire approach is funny.

observationist•4mo ago
No, because there are algorithmic shortcuts that allow approximations and skipped steps in comparison to a strict binary step-by-step calculation, by using in-memory bit reads and implicit rules, among other structural advantages in how GPUs and CPUs instruction sets are implemented in hardware.
nickpsecurity•4mo ago
FPGA's could be highly-competitive for models with unusual, but small, bit lengths. Especially single bits since their optimizers will handle that easily.
fxtentacle•4mo ago
In this paper, each iteration has to be slower. Because they need to calculate both their new method (which may be faster) and also the traditional method (because they need a float gradient). And old+new will always be slower than just old.
PaulHoule•4mo ago
When I was working for startups trying to develop foundation models circa 2015 we were concerned with training more than inference.

Today with models that are actually useful training costs matters much less than inference costs. A 10x increase in training costs is not necessarily prohibitive if you get a 10x decrease in inference costs.

nickpsecurity•4mo ago
I still don't have a GPT3-class model that was trained without copyright infringement. I'd have so many uses for it from research to production. What's stopping me is the $30 million training cost for 180B models. Even a 30B like Mosaic cost over a million dollars.

So, I strongly disagree unless we're talking about the five or six companies that already spend tens of millions on training and keep repeating that. Outside of them, the medium to large models are done infrequently or one off by a small number of other companies. Then, most of us are stuck with their pretraining efforts because we can't afford it ourselves.

On my end, I'd rather see a model that drops pretraining costs to almost nothing but costs 10-32x more to do inference. My uses would produce mere MB of output vs hundreds of GB to TB that pretraining requires. A competitive use that costs 32x current prices would probably be profitable for me. Optimizations, which are plentiful for inference, might bring it down further.

arthurcolle•4mo ago
Why are you making something cheap more expensive than it needs to be?
nickpsecurity•4mo ago
It's not cheap. It costs millions to $100 million depending on the model. I was responding to this tradeoff:

"A 10x increase in training costs is not necessarily prohibitive if you get a 10x decrease in inference costs."

Given millions and up, I'd like that to be 10x cheaper while inference was 10x more expensive. Then, it could do research or coding for me at $15/hr instead of $1.50/hr. I'd just use it carefully with batching.

imtringued•4mo ago
Calculating the gradient requires a forward pass (inference) and a backward pass (back propagation).

They cost roughly the same, with the backwards pass being maybe 50% more expensive. So let's say three times the cost of a forward pass.

You can't make training faster by making inference slower.

nickpsecurity•4mo ago
I was responding to their claim by starting with an assumption that it may be correct. I don't know the cost data myself. Now, I'll assume what you say is true.

That leaves computation and memory use of two passes plus interlayer communication.

I think backpropagation doesn't occur in the brain since it appears to use local learning but global optimization probably happens during sleep/dreaming. I have a lot of papers on removing backpropagation, Hebbien learning, and "local, learning rules."

From there, many are publishing how to do training at 8-bit and below. A recent one did a mix of low-bit training with sub-1-bit storage for weights. The NoLayer architecture might address interlayer better.

People keep trying to build analog accelerators. There are mismatches between their features and hardware. Recent works have come up with analog NN's that work well with analog hardware.

A combination of those would likely get cost down dramatically on both inference and training. Also, energy use would be lower.

PaulHoule•4mo ago
I think you're right but there has to be a limit. If I'm training a model I'm going to do a significant amount of inference to evaluate it and support the training.
pixelpoet•4mo ago
The critical "1" is missing from the title...
Dylan16807•4mo ago
"Bit" being singular gets the intent across just fine.
pixelpoet•4mo ago
Disagree
hirako2000•4mo ago
I also* disagree, otherwise we would say, kilo of meat is enough?
pixelpoet•4mo ago
Yes, that's the point I was making, and the other person said it's fine without saying how many bits, not me.
hirako2000•4mo ago
My bad, I meant that I disagree with parent, I edited it. I agree with you.
user823749•4mo ago
Yes, just like in "16 bit integer". No confusion at all.
jongjong•4mo ago
This reminds me if my university days. For one of the assignments, we had to write our own ANN from scratch for handwriting recognition and we implemented a step activation function because that was easier than sigmoid; basically each layer would output one or zero though I guess the weights themselves were scalars. It's just the node outputs which were 1 or 0... But this was convenient because the output of the final layer could be interpreted as a binary which could be converted straight into an ASCII character for comparison and backpropagation.
meindnoch•4mo ago
>could be interpreted as a binary which could be converted straight into an ASCII character for comparison and backpropagation.

There's nothing to backpropagate with a step function. The derivative is zero everywhere.

steppi•4mo ago
It sounds like jongjong was probably using surrogate gradients. You keep the step activation in the forward pass but replace with a smooth approximation in the backwards pass.
bjourne•4mo ago
Yeah, but then there is no performance benefit over plain old sgd.
steppi•4mo ago
Yeah, I think surrogate gradients are usually used to train spiking neural nets where the binary nature is considered an end in itself, for reasons of biological plausibility or something. Not for any performance benefits. It's not an area I really know that much about though.
nickpsecurity•4mo ago
There's performance benefits when they're implemented in hardware. The brain is a mixed-signal system whose massively-parallel, tiny, analog components keep it ultra-fast at ultra-low energy.

Analog NN's, including spiking ones, share some of those properties. Several chips, like TrueNorth, are designed to take advantage of that on biological side. Others, like Mythic AI's, are accelerating normal types of ML systems.

jongjong•4mo ago
I can't remember the name of the algorithm we used. It wasn't doing gradient descent but it was a similar principle; basically adjust the weights up or down by some fixed amount proportional to their contribution to the error. It was much simpler than calculating gradients but it still gave pretty good results for single-character recognition.
forntentacl•4mo ago
This paper ignores 50+ years of research in the domain of quantized networks, quantized training algorithms, and reaches wrong conclusions out of sheer ignorance.

TLDR abstract of a draft paper I wrote years ago, for those interested in the real limits of quantized networks:

We investigate the storage capacity of single‐layer threshold neurons under three synaptic precision regimes—binary (1‐bit), ternary (≈1.585‐bit), and quaternary (2‐bit)—from both information‐theoretic and algorithmic standpoints. While the Gardner bound stipulates maximal loads of α=0.83, 1.5 and 2.0 patterns per weight for the three regimes, practical algorithms only reach α_alg≈0.72, 1.0 and 2.0, respectively. By converting these densities into storage‐efficiency metrics—bits of synaptic memory per stored pattern—we demonstrate that only quaternary weights achieve the theoretical optimum in realistic settings, requiring exactly 1 bit of memory per pattern. Binary and ternary schemes incur 39 % and 58 % overheads, respectively.

naasking•4mo ago
Is this actually equivalent to classical forms of quantization though? The paper has extensive discussion of quantization on page 2 and 3. This paper is not just a rehash of earlier work, but pushes the single bit precision to more parts of the system.
thijson•4mo ago
How does this compare to:

https://arxiv.org/pdf/1811.11431