frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Apache Poison Fountain

https://gist.github.com/jwakely/a511a5cab5eb36d088ecd1659fcee1d5
1•atomic128•1m ago•0 comments

Web.whatsapp.com appears to be having issues syncing and sending messages

http://web.whatsapp.com
1•sabujp•1m ago•1 comments

Google in Your Terminal

https://gogcli.sh/
1•johlo•3m ago•0 comments

Shannon: Claude Code for Pen Testing

https://github.com/KeygraphHQ/shannon
1•hendler•3m ago•0 comments

Anthropic: Latest Claude model finds more than 500 vulnerabilities

https://www.scworld.com/news/anthropic-latest-claude-model-finds-more-than-500-vulnerabilities
1•Bender•7m ago•0 comments

Brooklyn cemetery plans human composting option, stirring interest and debate

https://www.cbsnews.com/newyork/news/brooklyn-green-wood-cemetery-human-composting/
1•geox•7m ago•0 comments

Why the 'Strivers' Are Right

https://greyenlightenment.com/2026/02/03/the-strivers-were-right-all-along/
1•paulpauper•9m ago•0 comments

Brain Dumps as a Literary Form

https://davegriffith.substack.com/p/brain-dumps-as-a-literary-form
1•gmays•9m ago•0 comments

Agentic Coding and the Problem of Oracles

https://epkconsulting.substack.com/p/agentic-coding-and-the-problem-of
1•qingsworkshop•10m ago•0 comments

Malicious packages for dYdX cryptocurrency exchange empties user wallets

https://arstechnica.com/security/2026/02/malicious-packages-for-dydx-cryptocurrency-exchange-empt...
1•Bender•10m ago•0 comments

Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

https://github.com/pheonix-delta/axiom-voice-agent
1•shubham-coder•11m ago•0 comments

Penisgate erupts at Olympics; scandal exposes risks of bulking your bulge

https://arstechnica.com/health/2026/02/penisgate-erupts-at-olympics-scandal-exposes-risks-of-bulk...
4•Bender•11m ago•0 comments

Arcan Explained: A browser for different webs

https://arcan-fe.com/2026/01/26/arcan-explained-a-browser-for-different-webs/
1•fanf2•13m ago•0 comments

What did we learn from the AI Village in 2025?

https://theaidigest.org/village/blog/what-we-learned-2025
1•mrkO99•13m ago•0 comments

An open replacement for the IBM 3174 Establishment Controller

https://github.com/lowobservable/oec
1•bri3d•16m ago•0 comments

The P in PGP isn't for pain: encrypting emails in the browser

https://ckardaris.github.io/blog/2026/02/07/encrypted-email.html
2•ckardaris•18m ago•0 comments

Show HN: Mirror Parliament where users vote on top of politicians and draft laws

https://github.com/fokdelafons/lustra
1•fokdelafons•18m ago•1 comments

Ask HN: Opus 4.6 ignoring instructions, how to use 4.5 in Claude Code instead?

1•Chance-Device•20m ago•0 comments

We Mourn Our Craft

https://nolanlawson.com/2026/02/07/we-mourn-our-craft/
1•ColinWright•22m ago•0 comments

Jim Fan calls pixels the ultimate motor controller

https://robotsandstartups.substack.com/p/humanoids-platform-urdf-kitchen-nvidias
1•robotlaunch•26m ago•0 comments

Exploring a Modern SMTPE 2110 Broadcast Truck with My Dad

https://www.jeffgeerling.com/blog/2026/exploring-a-modern-smpte-2110-broadcast-truck-with-my-dad/
1•HotGarbage•26m ago•0 comments

AI UX Playground: Real-world examples of AI interaction design

https://www.aiuxplayground.com/
1•javiercr•27m ago•0 comments

The Field Guide to Design Futures

https://designfutures.guide/
1•andyjohnson0•27m ago•0 comments

The Other Leverage in Software and AI

https://tomtunguz.com/the-other-leverage-in-software-and-ai/
1•gmays•29m ago•0 comments

AUR malware scanner written in Rust

https://github.com/Sohimaster/traur
3•sohimaster•31m ago•1 comments

Free FFmpeg API [video]

https://www.youtube.com/watch?v=6RAuSVa4MLI
3•harshalone•32m ago•1 comments

Are AI agents ready for the workplace? A new benchmark raises doubts

https://techcrunch.com/2026/01/22/are-ai-agents-ready-for-the-workplace-a-new-benchmark-raises-do...
2•PaulHoule•37m ago•0 comments

Show HN: AI Watermark and Stego Scanner

https://ulrischa.github.io/AIWatermarkDetector/
1•ulrischa•37m ago•0 comments

Clarity vs. complexity: the invisible work of subtraction

https://www.alexscamp.com/p/clarity-vs-complexity-the-invisible
1•dovhyi•38m ago•0 comments

Solid-State Freezer Needs No Refrigerants

https://spectrum.ieee.org/subzero-elastocaloric-cooling
2•Brajeshwar•38m ago•0 comments
Open in hackernews

Universal Reasoning Model (53.8% pass 1 ARC1 and 16.0% ARC 2)

https://arxiv.org/abs/2512.14693
131•marojejian•1mo ago

Comments

marojejian•1mo ago
Sounds like a further improvement in the spirit of HRM & TRM models.

Decent comment via x: https://x.com/r0ck3t23/status/2002383378566303745

I continue to be fascinated by these architectures that: - Build in recurrence / inference scaling to transformers more natively. - Don't use full recurrent gradient traces, and succeed not just despite, but because of that.

Moosdijk•1mo ago
Interesting. Instead of running the model once (flash) or multiple times (thinking/pro) in its entirety, this approach seems to apply the same principle within one run, looping back internally.

Instead of big models that “brute force” the right answer by knowing a lot of possible outcomes, this model seems to come to results with less knowledge but more wisdom.

Kind of like having a database of most possible frames in a video game and blending between them instead of rendering the scene.

omneity•1mo ago
Isn’t this in a sense an RNN built out of a slice of an LLM? Which if true means it might have the same drawbacks, namely slowness to train but also benefits such as an endless context window (in theory)
ctoa•1mo ago
It's sort of an RNN, but it's also basically a transformer with shared layer weights. Each step is equivalent to one transformer layer, the computation for n steps is the same as the computation for a transformer with n layers.

The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence.

omneity•1mo ago
Thanks, this was helpful! Reading the seminal paper[0] on Universal Transformers also gave some insights:

> UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs.

Very interesting, it seems to be an “old” architecture that is only now being leveraged to a promising extent. Curious what made it an active area (with the works of Samsung and Sapient and now this one), perhaps diminishing returns on regular transformers?

0: https://arxiv.org/abs/1807.03819

nl•1mo ago
> Instead of running the model once (flash) or multiple times (thinking/pro) in its entirety

I'm not sure what you mean here, but there isn't a difference in the number of times a model runs during inference.

Moosdijk•1mo ago
I meant going to the likeliest output (flash) or (iteratively) generating multiple outputs and (iteratively) choosing the best one (thinking/pro)
nl•1mo ago
That's not how these models work.

Thinking models produce thinking tokens to reason out the answer.

mysterEFrank•1mo ago
I'm surprised more attention isn't paid to this research direction, that nobody has tried to generalize it for example by combining the recurrence concept with next token prediction. That said despite the considerable gains this seems to just be some hyperparameter tweaking rather than a foundational improvement.
whiplash451•1mo ago
Not just hyper parameter tweaking. Not foundational research either. But rather engineering improvements that compound with each other (conswiglu layers, muon optimizer)
in-silico•1mo ago
> nobody has tried to generalize it for example by combining the recurrence concept with next token prediction

Here you go: https://arxiv.org/abs/2502.05171

mysterEFrank•1mo ago
Thanks! This seems to work incredibly well.
E-Reverance•1mo ago
It should be noted that this is NOT the official scores on the private evaluation set
viraptor•1mo ago
Here it matters much less than in generic LLMs though. There's no chance of test set leakage since the network is not general purpose / not trained on the internet.
amluto•1mo ago
This design implicitly does something similar to something that I sometimes think conventional transformers should try: allowing later layers to query the KV data from earlier layers. As far as I can tell, with a conventional transformer, if a layer (and presumably higher-level-thinking) layer wants wants to take input from earlier tokens from something lower down, it needs to get it from the output and “remember” it by itself instead of just reading it directly.

But suppose an extra attention head were added that queried the KV data from lower layers. At the very least, I imagine this might cleanly solve the STRAWBERRY problem: whatever layer has figured out that the prompt wants to count instances of R could attend to lower layers that actually perceive those Rs.

krackers•1mo ago
> extra attention head were added that queried the KV data from lower layers

Isn't this sort of similar to latent looping? E.g. [1]. But actually as [2] argues, even that wasn't a good experiment because it used the very last hidden state, which is too close to the logits and loses most of the rich embedding structure. Perhaps you don't even need access to the state of anything except the penultimate hidden layer, since based on my vague reading of [3] the residual stream doesn't "lose information" as it passes deeper down the attention layers, so each block maybe manipulates a different subspace of the residual stream.

[1] https://arxiv.org/abs/2412.06769

[2] https://snimu.github.io/2025/03/30/multi-layer-language-head...

[3] https://news.ycombinator.com/item?id=45758093

amluto•1mo ago
> Perhaps you don't even need access to the state of anything except the penultimate hidden layer, since based on my vague reading of [3] the residual stream doesn't "lose information" as it passes deeper down the attention layers, so each block maybe manipulates a different subspace of the residual stream.

I imagine that conventional transformers kind of force this. If you train a transformer such that it needs to learn the ability to do tasks like “Repeat the following words: apple banana cat” then the model is sort of forced to internally propagate the input far enough along to be able to perform the task. But maybe if you pre-trained from scratch with an architecture where later layers get direct access to earlier layers and/or the raw input, then the model wouldn’t need to propagate information.

Or maybe it would all fall apart and something would go wrong with the gradients.

krackers•1mo ago
Apparently a new paper from DS shows this is not the case, or rather the information isn't captured with as much fidelity as you'd expect. Intuitively the residual stream apparently doesn't have enough dimension to allow each layer to carve out its own subspace [1]

>And this makes it hard for layers to explore new features that are beneficial for just a few layers because you need to revert or overwrite those features as they will not be useful for later layers.

Since with a residual stream architecture, removing features can't be done by simply zeroing out a weight but instead you have to calculate the inverse.

>This leads each layer to contribute "generally useful" features and one immediate pattern is continuously refining features. I think this is the reason why later layers in LLMs tend to behave like that.

Greatly increasing the number of "channels" of the residual stream helps however (although you have to play some tricks to preserve the useful "identity mapping" behavior) [2, 3]

[1] https://x.com/rosinality/status/2006902561727721670

[2] https://x.com/norxornor/status/2006649194690257285#m

[3] https://x.com/byebyescaling/status/2007147288809087281#

yorwba•1mo ago
This architecture does not allow later layers to directly query KV data from earlier layers. Each iteration of the loop uses the same layer parameters, so the KV data in later layers may well end up being the same, but only if the model stops changing it in response to other tokens in the context. Which is also something a traditional multi-layer transformer could do. (But might not end up doing due to lack of corresponding inductive bias.)

None of this helps with the strawberry problem, where the very first layer already gets a tokenized representation, so there is no layer that "actually perceives those Rs."

cainxinth•1mo ago
Is it fair to say that the “Rs in strawberry problem” will not be “cleanly” solved unless we advance beyond tokenization?
idiotsecant•1mo ago
I think tokenization is probably not going anywhere, but higher layers need the ability to inspect 'raw' data on demand. You don't spell out most words as you read them, but you can bring the focus of your entire mind to the spelling of the word strawberry if you so choose. Models need that ability as well.
hippo22•1mo ago
Couldn’t this be solved by replacing the tokenized input with a model that outputs the tokenization and then training the entire thing as one larger model? The goal would be to make tokenization a function of the model input.
nl•1mo ago
I don't see why that follows.

The “Rs in strawberry problem” is literally "count the token R" in the word "strawberry".

One could argue that the learnt tokenization model where it is tokenized into 3 tokens (see https://platform.openai.com/tokenizer) is problematic, but one solution to that is to double-down on it and learn tokenization as part of the end-to-end training instead of separately.

If you mean that the idea of the current idea of the tokenization model being entirely fixed then I agree.

(I'm not entirely sure how multi-modal models function in this regard - they must have a idea of the bytestream, but not familiar enough with that to comment intelligently.)

ImHereToVote•1mo ago
I can't instinctively process how many R's are in STRAWBERRY. I use my vision to get it though almost immediately.

I feel simple transformers simply don't get access to those modalities that a human would use. I can't use my "talking" centers to count letters in words either.

You just need to pay attention to understand you don't use your language skills to count words.

andy12_•1mo ago
I remember doing this kind of test in a vanilla transformer trained on my laptop on a small text dataset. I basically added N^3 attention where each layer could pay attention to previous layers. It didn't improve anything and was much slower.

Hard to say whether something scales or not from a couple dozen million parameters to an actual billion-sized model, but I have the impression that the nature of the residual stream and its high dimensionality allows any layer to access information of previous layers if the transformers needs it.

test123asdfasdf•1mo ago
Isn't that just a higher dimensional neural net, i.e. connections along more axes.
krackers•1mo ago
Maybe slightly related, canon layers provide direct horizontal information flow along residual steams. See this paper, which precisely claims that LLMs struggle with horizontal information flow as "looking back a token" is fairly expensive since it can only be done via encoding in the residual stream and attention layers

https://openreview.net/pdf?id=kxv0M6I7Ud

mlpro•1mo ago
Lol. trying to copy the Universal Weight Subspace paper's naming to get famous.
numbers_guy•1mo ago
I'm confused about ARC-AGI. I thought the point of it was that you train a foundational model. Then you test it against ARC-AGI to figure out how well it reasons. Here and in some of the other reasoning papers, they are training on ARC-AGI. How much sense does that make in practice?
whiplash451•1mo ago
ARC-AGI allows (and encourages) training on their training set. Their evaluation setup is rigorous enough to avoid leaking between training and testing (public and private).