frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
391•klaussilveira•5h ago•85 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
749•xnx•10h ago•459 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
118•dmpetrov•5h ago•48 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
131•isitcontent•5h ago•14 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
234•vecti•7h ago•113 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
28•quibono•4d ago•1 comments

A century of hair samples proves leaded gas ban worked

https://arstechnica.com/science/2026/02/a-century-of-hair-samples-proves-leaded-gas-ban-worked/
57•jnord•3d ago•3 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
302•aktau•11h ago•152 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
304•ostacke•11h ago•82 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
160•eljojo•8h ago•121 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
377•todsacerdoti•13h ago•214 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
44•phreda4•4h ago•7 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
305•lstoll•11h ago•230 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
100•vmatsiiako•10h ago•34 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
167•i5heu•8h ago•127 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
138•limoce•3d ago•76 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
223•surprisetalk•3d ago•29 comments

FORTH? Really!?

https://rescrv.net/w/2026/02/06/associative
36•rescrv•12h ago•17 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
956•cdrnsf•14h ago•413 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
8•gfortaine•2h ago•0 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
7•kmm•4d ago•0 comments

Evaluating and mitigating the growing risk of LLM-discovered 0-days

https://red.anthropic.com/2026/zero-days/
33•lebovic•1d ago•11 comments

I'm going to cure my girlfriend's brain tumor

https://andrewjrod.substack.com/p/im-going-to-cure-my-girlfriends-brain
30•ray__•1h ago•6 comments

Claude Composer

https://www.josh.ing/blog/claude-composer
97•coloneltcb•2d ago•68 comments

The Oklahoma Architect Who Turned Kitsch into Art

https://www.bloomberg.com/news/features/2026-01-31/oklahoma-architect-bruce-goff-s-wild-home-desi...
17•MarlonPro•3d ago•2 comments

Show HN: Smooth CLI – Token-efficient browser for AI agents

https://docs.smooth.sh/cli/overview
76•antves•1d ago•56 comments

Show HN: Slack CLI for Agents

https://github.com/stablyai/agent-slack
37•nwparker•1d ago•8 comments

How virtual textures work

https://www.shlom.dev/articles/how-virtual-textures-really-work/
23•betamark•12h ago•22 comments

Evolution of car door handles over the decades

https://newatlas.com/automotive/evolution-car-door-handle/
38•andsoitis•3d ago•61 comments

The Beauty of Slag

https://mag.uchicago.edu/science-medicine/beauty-slag
27•sohkamyung•3d ago•3 comments
Open in hackernews

The lottery ticket hypothesis: why neural networks work

https://nearlyright.com/how-ai-researchers-accidentally-discovered-that-everything-they-thought-about-learning-was-wrong/
135•076ae80a-3c97-4•5mo ago

Comments

derbOac•5mo ago
In some sense, isn't this overfitting, but "hidden" by the typical feature sets that are observed?

Time and time again, some kind of process will identify some simple but absurd adversarial "trick stimulus" that throws off the deep network solution. These seem like blatant cases of over fitting that go unrecognized or unchallenged in typical life because the sampling space of stimuli doesn't usually include the adversarial trick stimuli.

I guess I've not really thought of the bias-variance tradeoff necessarily as being about number of parameters, but rather, the flexibility of the model relative to the learnable information in the sample space. There's some formulations (e.g., Shtarkov-Rissanen normalized maximum likelihood) that treat overfitting in terms of the ability to reproduce data that is wildly outside a typical training set. This is related to, but not the same as, the number of parameters per se.

api•5mo ago
This sounds like it's proposing that what's happening during large model training is a little bit akin to genetic algorithms: many small networks emerge and there is a selection process, some get fixed, and the rest fade and are then repurposed/drifted into other roles, repeat.
xg15•5mo ago
Wouldn't this imply that most of the inference time storage and compute might be unnecessary?

If the hypothesis is true, it makes sense to scale up models as much as possible during training - but once the model is sufficiently trained for the task, wouldn't 99% of the weights be literal "dead weight" - because they represent the "failed lottery tickets", i.e. the subnetworks that did not have the right starting values to learn anything useful? So why do we keep them around and waste enormous amounts of storage and compute on them?

tough•5mo ago
someone on twitter was exploring and linked to some related papers where you can for example trim experts on a MoE model if you're 100% sure they're never active for your specific task

what the bigger wide net bigs you is generalization

markeroon•5mo ago
Look into pruning
paulsutter•5mo ago
That’s exactly how it works, read up on pruning. You can ignore most of the weights and still get great results. One issue is that sparse matrices are vastly less efficient to multiply.

But yes you’ve got it

FuckButtons•5mo ago
For any particular single pattern learned 99% of the weights are dead weight. But it’s not the same 99% for each lesson learned.
janalsncm•5mo ago
Quick example, Kimi K2 is a recent large mixture of experts model. Each “expert” is really just a path within it. At each token, 32B out of 1T are active. This means only 3.2% are active for any one token.
Sophira•5mo ago
That sounds surprisingly like "Humans only use 10% of their brain at any given time."
highfrequency•5mo ago
Enjoyed the article. To play devil’s advocate, an entirely different explanation for why huge models work: the primary insight was framing the problem as next-word prediction. This immediately creates an internet-scale dataset with trillions of labeled examples, which also has rich enough structure to make huge expressiveness useful. LLMs don’t disprove bias-variance tradeoff; we just found a lot more data and the GPUs to learn from it.

It’s not like people didn’t try bigger models in the past, but either the data was too small or the structure too simple to show improvements with more model complexity. (Or they simply trained the biggest model they could fit on the GPUs of the time.)

pixl97•5mo ago
I think a lot of it is the massive amount of compute we've got in the last decade. While inference may have been possible on the hardware the training would have taken lifetimes.
graemep•5mo ago
I have a textbook somewhere in the house from about 2000 that says that there is no point having more than three layers in a neural network.

Compute was just too expensive to have neural networks big enough for this not to be true.

leumassuehtam•5mo ago
People believe that more parameters would lead to overfit instead generalization. The various regularization methods we use today to avoid overfit hadn't been discovered yet. Your statement is mostly likely about this.
graemep•5mo ago
Possibly, I would have to dig up the book to check. IIRC it did not mention overfitting but it was a long time ago.
Silphendio•5mo ago
I think the problems with big network were diminishing gradients, which is why we now use the ReLU activation function, and training stability, which were solved with residual connections.

Overfitting is the problem of having too little training data for your network size.

AndrewOMartin•5mo ago
Once you have three layers (i.e. one "hidden" layer) then you can map to arbitrary functions, so a three layer network has the same "power" as an arbitrarily large network.

I'm sure that's what the text book meant, rather than any point about the expense of computing power.

Eisenstein•5mo ago
Why does 'next-word prediction' explain why huge models work? You saying we needed scale, and saying we use next-word prediction, but how does one relate to the other? Diffusion models also exist and work well for images, and they do seem to work for LLMs too.
krackers•5mo ago
I think it's the same underlying principle of learning the "joint distribution of things humans have said". Whether done autoregressively via LLMs or via diffusion models, you still end up learning this distribution. The insight seems to be the crazy leap that this is A) a valid thing to talk about and B) that learning this distribution gives you something meaningful.

The leap is in transforming an ill-defined objective of "modeling intelligence" into a concrete proxy objective. Note that the task isn't even "distribution set of valid/true things", since validity/truth is hard to define. It's something akin to "distribution of things a human might say" implemented in the "dumbest" possible way of "modeling the distribution of humanity's collective textual output".

highfrequency•5mo ago
To crack NLP we needed a large dataset of labeled language examples. Prior to next-word prediction, the dominant benchmarks and datasets were things like translation of English to German sentences. These datasets were on the order of millions of labeled examples. Next-word prediction turned the entire Internet into labeled data.
Salgat•5mo ago
RNN worked that way too, the difference is that Transformers are parallelized, which is what made next-word prediction work so good, you could have an input thousands of tokens in length without needing your training to be thousands of times longer.
unyttigfjelltol•5mo ago
Language then would be the key factor enabling complex learning in meat space too? I feel like I’ve heard this debate before….
agarsev•5mo ago
as a researcher in NLP slash computational linguistics, this is what I tend to think :) (maybe a less strong version, though, there are other kinds of thinking and learning).

so I'm always surprised when some linguists decry LLMs, and cling to old linguistics paradigms instead of reclaiming the important role of language as (a) vehicle of intelligence.

mindwok•5mo ago
I'm no expert but have been thinking about this a lot lately. I wouldn't be surprised - language itself seems to be an expression of the ability to create an abstraction, distill the world into compressed representations, and manipulate symbols. It seems fundamental to human intelligence.
physix•5mo ago
As a layman, it helps me to understand the importance of language as a vehicle of intelligence by realizing that without language, your thoughts are just emotions.

And therefore I always thought that the more you master a language the better you are able to reason.

And considering how much we let LLMs formulate text for us, how dumb will we get?

highfrequency•5mo ago
> without language, your thoughts are just emotions.

Is that true though? Seems like you can easily have some cognitive process that visualizes things like cause and effect, simple algorithms or at least sequences of events.

ajuc•5mo ago
> without language, your thoughts are just emotions

That's not true. You can think "I want to walk around this building" without words, in abstract thoughts or in images.

Words are a layer above the thoughts, not the thoughts themselves. You can confirm this if you ever had the experience of trying to say something but forgetting the right word. Your mind knew what it wants to say but it didn't knew the word.

Chess players operate on sequences of moves dozen turns ahead in their minds using no words, seeing the moves on the virtual chessboards they imagine.

Musicians hear the note they want to play in their minds.

Our brains have full multimedia support.

mindwok•5mo ago
It's probably not as simple as just being emotions, but actually there's a really interesting example here: Helen Keller. In her autobiography she describes what it was like before she learned language, and how she remembers it being almost unconscious and just a mix of feelings and impulses. It's fascinating.
Sophira•5mo ago
In other words, we're rediscovering the lessons from George Orwell's Nineteen Eighty-Four. Language is central to understanding; remove subversive language and you remove the ability to even think about it.
xg15•5mo ago
I think it doesn't have to follow. You could also generalize the idea and see learning as successfully being able to "use the past to predict the future" for small time increments. Next-word prediction would be one instance of this, but for humans and animals, you could imagine the same process with information from all senses. The "self-supervised" trainset is then just, well, life.
kazinator•5mo ago
I think that the takeaway message for meat space (if there is one) is that continuous life-long learning is where it is at: keep engaging your brain and playing the lottery in order to foster the winning tickets. Be exposed to a variety of stimuli and find relationships.
deepsun•5mo ago
Same thing with Computer Vision, as Andrew Ng pointed out, the main thing that enabled the rapid progress was not new models, but mostly due to large _labeled_ datasets, particularly ImageNet.
Nevermark•5mo ago
Yes larger usable datasets, paired with an acceleration of mainstream parallel computing power (GPUs), with increasing algorithm flexibility (CUDA).

Without all three, progress would have been much slower.

highfrequency•5mo ago
Do you have a link handy for where he says this explicitly?
noboostforyou•5mo ago
Here's any older interview where he talks about the need for accurate dataset labeling -

"In many industries where giant data sets simply don’t exist, I think the focus has to shift from big data to good data. Having 50 thoughtfully engineered examples can be sufficient to explain to the neural network what you want it to learn."

https://spectrum.ieee.org/andrew-ng-data-centric-ai

kazinator•5mo ago
But regarding this Lottery Ticket hypothesis, what it means is that a small percentage of the parameters can be identified such that: when those parameters are taken by themselves and reset to their original pre-training weights, and the resulting network is trained on the same data as the parent, it performs similarly to the parent. So in fact, it seems that far fewer parameters are needed to encode predictions across the Internet-scale dataset. The large model is just creating a space in which that small "rock star" subset of parameters can be automatically discovered. It's as if the training establishes a competition among small subsets of the network, where a winner emerges.

Perhaps there is a kind of unstable situation whereby once the winner starts to anneal toward predicting the training data, it is doing more and more of the predictive work. The more relevant the subset shows itself to the result, the more of the learning it captures, because it is more responsive to subsequent learning.

belter•5mo ago
This article is like a quick street rap. Lots of rhythm, not much thesis. Big on tone, light on analysis...Or no actual thesis other than a feelgood factor. I want these 5 min back.
JasonSage•5mo ago
On the other hand, as somebody not well-read in AI I found it to be a rather intuitive explanation for why pruning helps avoid the overfitting scenario I learned when I first touched neural networks in the ‘10s.

Sure, this could’ve been a paragraph, but it wasn’t. I don’t think it’s particularly offensive for that.

fgfarben•5mo ago
Do you think a GPT that already trained on something "feels" the same way when reading it a second time?
mcphage•5mo ago
> I want these 5 min back.

Tell me, what is it you plan to do

with your five wild and precious minutes?

abhinuvpitale•5mo ago
Interesting article, is it concluding that different small networks are formed for different types of problems that we are trying to solve with the larger network?

How is this different from overfitting though? (PS: Overfitting isn't that bad if you think about it, as long as the test dataset or inference time model is trying to solve problems in the supposedly large enough training dataset)

deepfriedchokes•5mo ago
Rather than reframing intelligence itself, wouldn’t Occam’s Razor suggest instead that this isn’t intelligence at all?
pixl97•5mo ago
I don't really think that is what Occam’s Razor is about. The Razor says the simplest answer is most likely the best, but we already know that intelligence is very complex so the simplest answer to intelligence is still going to be a massively complex solution.

In some ways this answer does fit Occam's Razor by saying the simplicity is simply scale, not complex algorithms.

gavmor•5mo ago
> Intelligence isn't about memorising information—it's about finding elegant patterns that explain complex phenomena. Scale provides the computational space needed for this search, not storage for complicated solutions.

I think the word finding is overloaded, here. Are we "discovering," "deriving," "deducing," or simple "looking up" these patterns?

If "finding" can be implemented via a multi-page tour—ie deterministic choose-your-own-adventure—of a three-ring-binder (which is, essentially, how inference operates) then we're back at Searle's Chinese Room, and no intelligence is operative at runtime.

On the other hand, if the satisfaction of "finding" necessitates the creative synthesis of novel records pertaining to—if not outright modeling—external phenomena, ie "finding" a proof, then arguably it's not happening at training time, either.

How many novel proofs have LLMs found?

akomtu•5mo ago
Even simpler: intelligence is the art of simplifying. LLMs can fool us if they reduce a book into one wise-looking statement, but remove the deceptive medium - our language - and tell it to reduce a vast dataset of points into one formula, and LLMs will show how much intelligence they truly have.
Eisenstein•5mo ago
Unless you can provide a definition for intelligence which is internally consistent and does not exclude things are obviously intelligent or include things which are obviously not intelligent, the only thing occam's razor suggests is that the basis for solving novel problems is the ability to pattern match combined with a lot of background knowledge.
aeternum•5mo ago
IMO Occam's Razor suggests that this is exactly what intelligence is.

The ability to compress information, specifically run it through a simple rule that allows you to predict some future state.

The rules are simple but finding them is hard. The ability to find those rules, compress information, and thus predict the future efficiently is the very essence of intelligence.

gotoeleven•5mo ago
This article gives a really bad/wrong explanation of the lottery ticket hypothesis. Here's the original paper

https://arxiv.org/abs/1803.03635

brulard•5mo ago
Thanks for the 42 page long document. Can you explain in few words why you evaluated it as "really bad/wrong explanation"?
frrlpp•5mo ago
What are LLMs for?
jeremyscanvic•5mo ago
Exactly what I was looking for while reading the post. Thanks!
ghssds•5mo ago
Can someone explain how AI research can have a 300 years history?
anthonj•5mo ago
"For over 300 years, one principle governed every learning system: the bias-variance tradeoff."

The bias-variance tradeoff is a very old concept in statistics (but not sure how old, might very well be 300)

Anyway note the first algorithms realted to neural networks are older then the digital computer by a decade at least.

woadwarrior01•5mo ago
300 years is a stretch. But Legendre described linear regression ~220 years ago (1805). And from a very high level perspective, modern neural networks are mostly just stacks of linear regression layers with non-linearities sandwiched between them. I'm obviously oversimplifying it a lot, but that't the gist of it.
littlestymaar•5mo ago
Maybe it wasn't there originally, but now there's a footnote:

> 1. The 300-year timeframe refers to the foundational mathematical principles underlying modern bias-variance analysis, not the contemporary terminology. Bayes' theorem (1763) established the mathematical framework for updating beliefs with evidence, whilst Laplace's early work on statistical inference (1780s-1810s) formalised the principle that models must balance fit with simplicity to avoid spurious conclusions. These early statistical insights—that overly complex explanations tend to capture noise rather than signal—form the mathematical bedrock of what we now call the bias-variance tradeoff. The specific modern formulation emerged over several decades in the latter 20th century, but the core principle has governed statistical reasoning for centuries.↩

nitwit005•5mo ago
> For over 300 years, one principle governed every learning system

This seems strangely worded. I assume that date is when some statistics paper was published, but there's no way to know with no definition or citations.

littlestymaar•5mo ago
There is in fact a footnote about the date:

> 1. The 300-year timeframe refers to the foundational mathematical principles underlying modern bias-variance analysis, not the contemporary terminology. Bayes' theorem (1763) established the mathematical framework for updating beliefs with evidence, whilst Laplace's early work on statistical inference (1780s-1810s) formalised the principle that models must balance fit with simplicity to avoid spurious conclusions. These early statistical insights—that overly complex explanations tend to capture noise rather than signal—form the mathematical bedrock of what we now call the bias-variance tradeoff. The specific modern formulation emerged over several decades in the latter 20th century, but the core principle has governed statistical reasoning for centuries.

nitwit005•5mo ago
They added it after my post.
doctoboggan•5mo ago
This article definitely feels like chatgptese.

Also, I don't necessarily feel like the size of LLMs even comes close to overfitting the data. From a very unscientific standpoint it seems like the size of weights on disk would have to meet or exceed the size of the training data (modulo lossless encryption techniques) for overfitting to occur. Since the training data is multiple orders of magnitude larger than the resulting weights, isn't that proof that the weights are some sort of generalization of the input data rather than a memorization?

porridgeraisin•5mo ago
1) yes it's definitely chatgpt

2) The weights are definitely a generalization. The compression-based argument is sound.

3) There is definitely no overfitting. The article however used the word over-parameterization, which is a different thing. And LLMs are certainly over-parameterized. They have more parameters than strictly required to represent the dataset in a degrees-of-freedom statistical sense. This is not a bad thing though.

Just like having an over-parameterized database schema:

  quiz(id, title, num_qns)
  question(id, text, answer, quiz_id FK)
can be good for performance sometimes,

The lottery ticket hypothesis as chatgpt explained in TFA means that over-parameterization can also be good for neural networks sometimes. Note that this hypothesis is strictly tied to the fact that we use SGD (or adam or ...) as the optimisation algorithm. SGD is known to be biased towards generalized compressions [the lottery ticket hypothesis hypothesises why this is so]. That is to say, it's not an inherent property of the neural network architecture or transformers or such.

quantgenius•5mo ago
The idea that simply having a lot of parameters leads to overfitting was shown to not be the case over 30 years ago by Vapnik et al. He proved that a large number of parameters is fine so long as you regularize enough. This is why Support Vector Machines work and I believe has a lot to do with why deep NNs work.

The issue with Vapnik's work is that it's pretty dense and actually figuring out the Vapnik-Chervonekis (VC) dimension etc is pretty complicated, and one can develop pretty good intuition once you understand the stuff without having to actually calculate, so most people don't take the time to do the calculation. And frankly, a lot of the time, you don't need to.

There may be something I'm missing completely, but to me the fact that models continue to generalize with a huge number of parameters is not all that surprising given how much we regularize when we fit NNs. A lot of the surprise comes from the fact that people in mathematical statistics and people who do neural networks (computer scientists) don't talk to each other as much as they should.

Strongly recommend the book Statistical Learning Theory by Vapnik for more on this.

math_dandy•5mo ago
I don't buy the narrative that the article is promoting.

I think the machine learning community was largely over overfitophobia by 2019 and people were routinely using overparametrized models capable of interpolating their training data while still generalizing well.

The Belkin et al. paper wasn't heresy. The authors were making a technical point - that certain theories of generalization are incompatible with this interpolation phenomenon.

The lottery ticket hypothesis paper's demonstration of the ubiquity of "winning tickets" - sparse parameter configurations that generalize - is striking, but these "winning tickets" aren't the solutions found by stochastic gradient descent (SGD) algorithms in practice. In the interpolating regime, the minima found by SGD are simple in a different sense perhaps more closely related to generalization. In the case of logistic regression, they are maximum margin classifiers; see https://arxiv.org/pdf/1710.10345.

The article points out some cool papers, but the narrative of plucky researchers bucking orthodoxy in 2019 doesn't track for me.

ActorNightly•5mo ago
Yeah this article gets a whole bunch of history wrong.

Back in 2000s, the reason why nobody was pursuing neural nets was simply due to compute power, and the fact that you couldn't iterate fast enough to make smaller neural networks work.

People were doing genetic algorithms and PSO for quite some time. Everyone knew that multi dimentionality was the solution to overfitting - the more directions you can use to climb out of valleys the better the system performed.

jfrankle•5mo ago
whyyy
moi2388•5mo ago
Isn’t small or large in relation to the amount of data, and the current large models a result of there being so incredibly much data available?