In terms of a fresh perspective on designing learning systems, nested learning seems very interesting.
https://abehrouz.github.io/files/NL.pdf
Hearing the clarity, creativity, and force behind his thoughts and speech, I'd give a more than 1/200 chance Ali Behrouz gets himself a Turing award. At the very least, I think he will end making major contributions to AI.
I didn't, but only because I became personally interested in AI/ML at some point, so I actually had to learn it myself.
As an AI practitioner, I still couldn't explain eigenvectors or singular-value decomposition to you though.
> Responses to the query “Write a metaphor about time” clustered by applying PCA to reduce sentence embeddings to two dimensions. […] The responses form just two primary clusters: a dominant cluster on the left centered on the metaphor “time is a river,” and a smaller cluster on the right revolving around variations of “time is a weaver.”
I just gave Gemini 3 the same prompt and got something quite different:
>Time is a patient wind against the cliff face of memory. It does not strike with a hammer to break us; it simply breathes, grain by grain, until the sharp edges of grief are smoothed into rolling hills, and the names we thought were carved in stone are weathered into soft whispers.
These days, abstracts are so marketing/advertising forward that it's hard to even understand the claim.
Scene_Cast2•2mo ago
About the Superposition paper - this is close to what I've been thinking about over the past week. I'm thinking that concepts or choices in a "superposition" are harder for a fully-differentiable neural net to reason about. For example, if there's a "green" vs "purple" choice to be made, it can't fully commit to either (especially if they're 50-50), and will have to reason about both simultaneously (difficult due to nonlinear manifold space). Discretizing to tokens (non-differentiable argmax) forces a choice, and that allows it to reason about a single concept separately and easier.
energy123•2mo ago
If we use a random number generator then we will converge to 100% correct answers under pass@n in the limit.
A random number generator will eventually outperform or match all models (for large n) whenever top-p is less than 1 because the other models will most likely have some level of bias that makes correct CoTs mathematically impossible due to the tokens being too improbable and being filtered out by top-p, meaning that other models will asymptote to below 100% while the RNG will reach 100% in an almost surely sense.
Under this paper's logic doesn't that mean that the random number generator is a superior reasoner?
Scene_Cast2•2mo ago
energy123•2mo ago
Scene_Cast2•2mo ago
robrenaud•2mo ago
Also, in practice, models don't have that much semantic entropy of a given prompt. With temperature based sampling, models will tend to generate very similar but not identical responses.
boroboro4•2mo ago
tipsytoad•2mo ago
On the unstructured outputs, where you can’t just ratchet up the pass@k until it’s almost random, it switches the base model out for instruct, and in the worse case on livecodebench it uses a qwen-r1-distill as a _base_ model(!?) that’s an instruct model further fine tuned on R1’s reasoning traces. I assume that was because no matter how high the pass@k, a base model won’t output correct python.
mountainriver•2mo ago
I believe NVidia’s ProRL showed otherwise right?