frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Byte latent transformer: Patches scale better than tokens

https://arxiv.org/abs/2412.09871
65•dlojudice•4h ago

Comments

dlojudice•4h ago
This BLT approach is why "AI research is stalling" takes are wrong. Dynamic byte-level patches instead of tokens seems genuinely innovative, not just scaling up the same architecture. Better efficiency AND handling edge cases better? Actual progress. The field is still finding clever ways to rethink fundamentals.
zamalek•3h ago
I think the sentiment (at least my sentiment) is that "mainstream ML" has fallen into the transformer local minimum, and given the weight of the players in that space it will take a huge amount of force to move them out of it.

The likes of this, Mercury Coder, and even RKWV are definitely hopeful - but there's a pitch black shadow of hype and speculation to outshine.

anon291•3h ago
I disagree. Most AI innovation today is around things like agents, integrations, and building out use cases. This is possible because transformers have made human-like AI possible for the first-time in the history of humanity. These use-cases will remain the same even if the underlying architecture changes. The number of people working on new architectures today is way more than were working on neural networks in 2017 when 'attention is all you need' came out. Nevertheless, actual ML model researchers are only a small portion of the total ML/AI community, and this is fine.
janalsncm•2h ago
> AI innovation today

I think you are talking about something else. In my opinion, integration is very different from fundamental ML research.

anon291•38m ago
There is more fundamental ML research today than at any other point in history, including in non-transformer architectures. That is my point. It doesn't seem that way because 90%+ of 'ML research' has nothing to do with fundamental ML and is instead research around applications, which are indifferent to the underlying model at the end of the day. That was the point of my comment.
Retric•6m ago
That depends on where you draw the threshold for research being fundamental. It’s not hard to argue way less than 1% of AI research is actually fundamental compared to the early days, but that’s because the term is so arbitrary.
Retric•1h ago
The sheer scale of computation and data available is what’s pushing AI to near human levels. The same algorithms in 1980 wouldn’t be nearly as useful.
mdaniel•1h ago
I've secretly wondered if the next (ahem) quantum leap in output quality will arrive with quantum computing wherein answering 10,000 if statements simultaneously would radically change the inference pipeline

But I am also open to the fact that I may be thinking of this in terms of 'faster horses' and not the right question

spindump8930•49m ago
It's not clear how your perception of quantum computing would lead to 'faster horses' in the current view of NN architectures - keep mind that the common view of 'exploring many paths simultaneously' is at best an oversimplification (https://scottaaronson.blog/?p=2026).

That said, perhaps advances in computing fundamentals would lead to something entirely new (and not at all horselike).

anon291•37m ago
If you can tie in a loss function for a neural network into the quantum excitement state of a quantum system, then presumably, letting the system settle at the energy minimum would be equivalent to a training step, but perhaps much faster.
anon291•39m ago
It's true, but you can't deny the importance of the architecture. It's pretty clear that using simple perceptrons would not have led us down the same path.
Retric•9m ago
Sure, but I think a reasonable corollary is that new algorithms and architectures will show their strengths when new realms of computation become available.
spindump8930•38m ago
If you consider most of the dominate architectures in deeplearning type approaches, transformers are remarkably generic. If you reduce transformer like architectures to "position independent iterated self attention with intermediate transformations", they can support ~all modalities and incorporate other representations (e.g. convolutions, CLIP style embeddings, graphs or sequences encoded with additional position embeddings). On top of that, they're very compute friendly.

Two of the largest weaknesses seem to be auto-regressive sampling (not unique to the base architecture) and expensive self attention over very long contexts (whether sequence shaped or generic graph shaped). Many researchers are focusing efforts there!

Also see: https://www.isattentionallyouneed.com/

anon291•20m ago
Transformers are very close to some types of feed forward networks. The difference is that transformers can be trained in parallel without the need for auto-regression (which is slow, for training, but kind of nice for streaming , low-latency inference). It's a mathematical trick. RWKV makes it obvious.
janalsncm•2h ago
I think DeepSeek (v3 and r1) showed us that there’s still a ton of meat on the bone for fundamental research and optimization.
Lerc•56m ago
Absolutely, I have seen so many good ideas that have not yet made it into notable trained models.

A lot of that is because you need to have a lot more faith than "seems like a good idea" before you spend a few million in training that depends upon it.

Some of it is because when the models released now began training, a lot of those ideas hasn't been published yet.

Time will resolve most of that, cheaper and more performant hardware will allow a lot of those ideas to be tested without the massive commitment required to build the leading edge models.

Workaccount2•47m ago
The big guys are almost certainly incinerating millions a day on training "maybe it could show some promise" techniques. With the way things are right now, they are probably green lighting everything to find an edge.
joe_the_user•2h ago
I don't think you're understanding what the "stall" arguments are saying.

Certainly tweaks to performance continue but as understand it, the stalling argument looks at the tendency of broad, "subjective" llm performance to not get beyond a certain level. Basically, that the massive projects to throw more data and training at the thing results in more marginal apparent improvements than the jump(s) we say with GPT 2-3-3.5-4.

The situation imo is that some point once you've ingested and trained on all the world's digitized books, all the coherent parts of the Internet, etc., you a limit to what you get with just "predict next" training. More information after this is more of the same on a higher level.

But again, no doubt, progress on the level of algorithms will continue (Deep Seek was indication of what's possible). But the situation is such progress essentially allows adequate LLMs faster rather than any progress towards "general intelligence".

Edit: clarity and structure

gwern•2h ago
It is pretty much the same scaling, though: https://arxiv.org/pdf/2412.09871#page=10 It just lets you avoid some of the pathologies of BPEs.
spindump8930•44m ago
This paper is very cool, comes from respected authors, and is a very nice idea with good experiments (flop controlled for compute). It shouldn't be seen as a wall-breaking innovation though. From the paper:

> Existing transformer libraries and codebases are designed to be highly efficient for tokenizer-based transformer architectures. While we present theoretical flop matched experiments and also use certain efficient implementations (such as FlexAttention) to handle layers that deviate from the vanilla transformer architecture, our implementations may yet not be at parity with tokenizer-based models in terms of wall-clock time and may benefit from further optimizations.

And unfortunately wall-clock deficiencies mean that any quality improvement needs to overcome that additional scaling barrier before any big runs (meaning expensive) can risk using it.

armcat•3h ago
This was previously reported 5 months ago: https://news.ycombinator.com/item?id=42415122 (84 comments).

As an aside - I am a big fan of Luke Zettlemoyer and his team at the University of Washington. They've been doing cool NLP research for years!

Multiple Security Issues in Screen

https://security.opensuse.org/2025/05/12/screen-security-issues.html
2•_JamesA_•2m ago•0 comments

First white South Africans arrive in US as Trump claims they face discrimination

https://www.reuters.com/world/first-white-south-africans-fly-us-under-trump-refugee-plan-2025-05-12/
1•belter•3m ago•0 comments

Anthropic Cofounder: 'Manager Nerds' Will Be 'Incredibly Powerful'

https://www.businessinsider.com/anthropic-cofounder-jack-clark-ai-manager-nerds-2025-5
1•andrewfromx•4m ago•1 comments

Chinese researchers develop silicon-free transistor, claim efficient and fast

https://www.techradar.com/pro/chinese-researchers-develop-silicon-free-transistor-technology-claimed-to-be-fastest-and-most-efficient-ever-heres-what-we-know
1•gnabgib•4m ago•0 comments

Ninth Bridgewater Treatise

https://en.wikipedia.org/wiki/Ninth_Bridgewater_Treatise
1•benbreen•6m ago•0 comments

Coinbase set to join S&P 500

https://www.cnbc.com/2025/05/12/coinbase-joining-sp-500-replacing-discover-financial.html
1•mfiguiere•7m ago•0 comments

A Year Later: Getting Kicked Out of the Recurse Center

https://notebook.wesleyac.com/rc-reflection/
1•gaws•8m ago•0 comments

Reasoning LLMs Guide

https://docs.google.com/document/d/1AwylUdyciJhvYn-64ltpe79UL7_G-BmNwqs4NNt4oQ0/edit?usp=sharing
1•omarsar•8m ago•0 comments

FCC Seeks Comment on EchoStar Licenses of 2 GHz MSS Spectrum

https://docs.fcc.gov/public/attachments/DA-25-405A1.txt
1•impish9208•12m ago•0 comments

Self-hosting HyperDX for fun and profit

https://weberdominik.com/blog/self-host-hyperdx/
1•brendanashworth•13m ago•0 comments

Show HN: Launched Badges-showcase launches on HN, Reddit and more, not just PH

https://launched-badges.lovable.app/
1•sundaywong•16m ago•0 comments

ChatGPT could never get a PhD in geography

https://garymarcus.substack.com/p/chatgpt-blows-mapmaking-101
2•garymarcus•17m ago•3 comments

Why aren't more Windows programs written in JavaScript?

https://old.reddit.com/r/microsoft/comments/1kkzmmu/why_arent_more_windows_programs_written_in/
2•bundie•17m ago•0 comments

The DoD Is Looking for C-UAS Low-Cost Sensing Solutions

https://www.diu.mil/latest/diu-presents-c-uas-low-cost-sensing-challenge
1•josh_carterPDX•19m ago•0 comments

Show HN: Understand your current page at a glance – chrome extension

https://chromewebstore.google.com/detail/page-overview/linicdbaokahhhglapipfcadglghbadh
1•samiezkay•22m ago•0 comments

Why is Bella Ramsey the target of so much hate?

https://english.elpais.com/culture/2025-05-12/why-is-bella-ramsey-the-target-of-so-much-hate-the-last-of-us-star-sparks-the-fury-of-the-manosphere.html
5•geox•22m ago•1 comments

New obesity drugs are coming

https://www.nature.com/articles/d41586-025-00404-9
2•paulpauper•23m ago•0 comments

Zero-shot forecasting of chaotic systems

https://arxiv.org/abs/2409.15771
1•wil3•24m ago•0 comments

US-China Tariff Pause Spurs Stock Market Surge

https://www.nytimes.com/2025/05/11/business/us-china-trade-stock-market.html
1•paulpauper•24m ago•0 comments

What the hell are rare earth elements?

https://thehustle.co/originals/what-the-hell-are-rare-earth-elements?hubs_content=thehustle.co/&hubs_content-cta=What%20the%20hell%20are%20rare%20earth%20elements?
1•paulpauper•25m ago•1 comments

Hunting extreme microbes that redefine the limits of life

https://www.nature.com/articles/d41586-025-01464-7
1•gnabgib•25m ago•0 comments

AI-focused software engineering consulting for startups and dev teams

https://seconsultant.gumroad.com/l/soft
1•kalel314•26m ago•1 comments

MethaneSAT

https://www.methanesat.org/
1•simonebrunozzi•27m ago•0 comments

Observer Theory

https://writings.stephenwolfram.com/2023/12/observer-theory/
1•Anon84•30m ago•0 comments

Lavaforming

https://saparkitektar.is/PROJECTS
1•jonah•30m ago•0 comments

Little Dig Game

https://little-dig-ga.me/
1•gaws•31m ago•0 comments

Show HN: GS-Base – A multifunctional database tool with Python integration

https://citadel5.com/gs-base.htm
1•jpiech•32m ago•0 comments

Henk Rogers on buying Tetris and foiling

https://www.theguardian.com/games/2025/may/12/henk-rogers-interview-tetris-kgb
5•billybuckwheat•35m ago•0 comments

Hegel 2.0: The imaginary history of ternary computing (2018)

https://www.cabinetmagazine.org/issues/65/weatherby.php
1•Hooke•36m ago•0 comments

Macroscale ceramic origami structures with hyper-elastic coating

https://link.springer.com/article/10.1007/s42114-025-01284-3
1•PaulHoule•39m ago•0 comments