http://math.uchicago.edu/~shmuel/Network-course-readings/Mar...
(there's a copyright 2007 at the bottom of the linked page, which isn't explicitly "published in 2007" in my mind)
It feels like bishops pattern recognition but with a clearer tone (and a different field of course)
The PGM book is also structured very clearly for researchers in PGMs. The book is laid out in 3 major section: the models, inference techniques (the bulk of the book), and learning. Which means, if you follow the logic of the book, you basically have to work through 1000+ pages of content before you can actually start running even toy versions of these models. If you do need to get into the nitty-gritty of particular inference algorithms, I don't believe there is another textbook with nearly the level of scope and detail.
Bishop's section on PGMs from Pattern Recognition and Machine Learning is probably a better place to start learning about these more advanced models, and if you become very interested then Koller+Friedman will be an invaluable text.
It's worth noting that the PGM course taught by Koller was one of the initial, and still very excellent, Coursera courses. I'm not sure if it's still free, but it was a nice way to get a deep dive into the topic in a reasonably short time frame (I do remember those homeworks as brutal though!)[0].
0. https://www.coursera.org/specializations/probabilistic-graph...
The data never fits the graph. Real-world tables are messy and full of hidden junk, so you either spend weeks arguing over structure or give up the nice causal story.
DL stole the mind-share. A transformer is a one-liner with a mature tooling stack; hard to argue with that when deadlines loom.
That said, they’re not completely dead - reportedly Microsoft’s TrueSkill (Xbox ranking), a bunch of Google ops/diagnosis pipelines, some healthcare diagnosis tools by IBM Watson built on Infer.NET.
Anyone here actually shipped a PGM that beat a neural baseline? Would really love to appreciate your war stories.
Kind of like flow-based programming. I don't think there are any fundamental reason why it can't work, it just hasn't yet.
Could you link me to where I could learn more about this?
"Causality: Models, Reasoning and Inference", https://a.co/d/6b3TKhQ, is the technical and researcher audience book.
By the way, does anyone know which model or type of model was used in winning gold in IMO?
Might be a reference to this[1] blog post which was posted here[2] a year ago.
There has also been some academic work linking the two, like this[3] paper.
[1]: https://elijahpotter.dev/articles/markov_chains_are_the_orig...
It's not an unreasonable view, at least for decoder-only LLMs (which is what most popular LLMs are). While it may seem they violate the Markov property since they clearly do make use of their history, in practice that entire history is summarized in an embedding passed into the decoder. I.e.just like a Markov chain their entire history is compressed into a single point which leaves them conditionally independent of their past given their present state.
It's worth noting that this claim is NOT generally applicable to LLMs since both encoder/decoder and encoder-only LLMs do violate the Markov property and therefore cannot be properly considered Markov chains in a meaningful way.
But running inference on decoder only model is, at a high enough level of abstraction, is conceptually the same as running a Markov chain (on steroids).
Physics models of closed systems moving under classical mechanics are deterministic, continuous Markov processes. Random walks on a graph are non deterministic, discrete Markov processes.
You may further generalize that if a process has state X, and the prior N states contribute to predicting the next state, you can make a new process whose state is an N-vector of Xs, and the graph connecting those states reduces the evolution of the system to a random walk on a graph, and thus a Markov process.
Thus any system where the best possible model of its evolution requires you to examine at most finitely many consecutive states immediately preceding the current state is a Markov process.
For example, an LLM that will process a finite context window of tokens and then emit a weighted random token is most definitely a Markov process.
Define a square of some known size (1x1 should be fine, I think)
Inscribe a circle inside the square
Generate random points inside the square
Look at how many fall inside the square but not the circle, versus the ones that do fall in the circle.
From that, using what you know about the area of the square and circle respectively, the ratio of "inside square but not in circle" and "inside circle" points can be used to set up an equation for the value of pi.
Somebody who's more familiar with this than me can probably fix the details I got wrong, but I think that's the general spirit of it.
For Markov Chains in general, the only thing that jumps to mind for me is generating text for old school IRC bots. :-)
[1]: which is probably not the point of this essay. For for muddying the waters, I have both concepts kinda 'top of mind' in my head right now after watching the Veritasium video.
[1] https://claude.ai/public/artifacts/1b921a50-897e-4d9e-8cfa-0...
Back in like 9th grade, when Wikipedia did not yet exist (but MathWorld and IRC did) I did this with TI-Basic instead of paying attention in geometry class. It's cool, but converges hilariously slowly. The in versus out formula is basically distance from origin > 1, but you end up double sampling a lot using randomness.
I told a college roommate about it and he basically invented a calculus approach summing pixels in columns or something as an optimization. You could probably further optimize by finding upper and lower bounds of the "frontier" of the circle, or iteratively splitting rectangle slices in infinitum, but thats probably just reinventing state of the art. And all this skips the cool random sampling the monte carlo algorithm uses.
In the sample programs there's a big red one... https://www.dangermouse.net/esoteric/piet/samples.html
There's also the IOCCC classic https://www.ioccc.org/1988/westley/index.html
Monte Carlo Value for Pi
Each successive sequence of six bytes is used as 24 bit X and Y co-ordinates within a square. If the distance of the randomly-generated point is less than the radius of a circle inscribed within the square, the six-byte sequence is considered a “hit”. The percentage of hits can be used to calculate the value of Pi. For very large streams (this approximation converges very slowly), the value will approach the correct value of Pi if the sequence is close to random. A 500000 byte file created by radioactive decay yielded:
Monte Carlo value for Pi is 3.143580574 (error 0.06 percent).
It's often a matter of asking the right person what technique works. It's often a matter of making a measurement before getting lost in the math. It's often a matter of finding the right paper in the literature.
Is this still possible with the latest models being trained on synthetic data? And if it possible, what would that one phrase be?
> We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. [0]
In practice nobody is "indiscriminately" using model output to fine-tune models since that doesn't even make sense. Even if you're harvesting web data generated by LLMs, that data has in fact been curated by it's acceptance on whatever platform you found it on is a form of curation.
There was a very recent paper Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) [1] whose content is pretty well summarized by the title. So long as the data is curated in some way, you are providing more information to the model and the results should improve somewhat.
0. https://www.nature.com/articles/s41586-024-07566-y
1. https://www.arxiv.org/pdf/2507.12856
edit: updated based on cooksnoot's comment
If you just mean its risk has been over exaggerated and/or over simplified then yea, you'd have a point.
Having spent quite a bit of time diving into many questionable "research" papers (the original model collapse paper is not actually one of these, it's a solid paper), there's a very common pattern of showing that something does or does not work under special conditions but casually making generalized claims about those results. It's so easy with LLMs to find a way to get the result you want that there are far too many papers out there that people quickly take as fact when the claims are much, much weaker than the papers let on. So I tend to get a bit reactionary when addressing many of these "facts" about LLMs.
But you are correct that with the model collapse paper this is much more the public misunderstanding the claims of the original paper than any fault with that paper itself.
[0]: https://mattmahoney.net/dc/dce.html#Section_421 [1]: https://mattmahoney.net/dc/text.html
For example, https://news.ycombinator.com/item?id=44574033
satvikpendem•20h ago
[0] https://youtu.be/KZeIEiBrT_w
Alifatisk•20h ago
danlitt•14h ago
mapontosevenths•20h ago
reverendsteveii•19h ago
NO SMOKING. NO SPITTING. MGMT
baxtr•18h ago
stronglikedan•17h ago
tbalsam•17h ago
https://youtu.be/hJ-rRXWhElI?si=Zdsj9i_raNLnajzi
fn-mote•15h ago
Is there any way to get a better outcome for the public here, or is “do good stuff then sell out” the way it’s always going to be?
edwardbernays•15h ago
tomrod•10h ago
In the end, incentives matter.
https://en.wikipedia.org/wiki/Crowding_out_(economics)
edwardbernays•9h ago
tomrod•7h ago
Outside software technology: there is a series of papers from Grossman (going back to the 80s!) that analyzes basic versus applied research in a macroeconomic framework. Basic research _can_ be a public good, applied research can be crowded out. Combined with microeconomic research that monopolies can be dynamically efficient (investing in applied and basic R&D, like Bell Labs) and you get several examples and theories that contradict your statement that "there is no private market entity with an incentive to provide research to the public."
Another real world example in hardware that contradicts this claim is the evolution of building control systems. Before the advent of IOT, so, circa 1980s - 2010s, you saw increasing sharing and harmonization of competing electronics standards because it turned out to be more efficient to be modular, not have to re-hire subcontractors at exorbitant rates to maintain or replace components that go haywire, etc.
edwardbernays•6h ago
Economic analysis? Another intelligence product that requires essentially no staff, no actual R&D, no equipment besides computers? Brother, you have to be kidding me.
The hardware thing is just companies evolving to a shared standard.
Do you have even a little bit of a clue how hard it is to do good pharmacological research? Toxicological? Biological? Chemical? Physical? You have mentioned intelligence products with 0 investment cost and 0 risk of failure.
This is perhaps one of the most fart-sniffing tech-centric perspectives I have ever been exposed to. Go read some actual research by actual scientists and come back when you can tell me why, for instance, Eli Lilley would ever make their data or internal R&D public.
Jonas Salk did it. He is an extremely rare exception, and his incentive was public health. Notice that his incentive was markedly not financial.
Market entities with a financial incentive, whose entire business model and success is predicated on their unique R&D results, have 0 incentive to release research to the public.
ricardobeat•1h ago
They were also forced in the 1950s to license all their innovations freely, as compensation for holding a monopoly. Which only strengthens the parent’s point that private institutions have little incentive to work for public benefit.
0xDEAFBEAD•4h ago
aarond0623•16h ago
There was also the Waymo ad and the Rods from the Gods video where he couldn't bother to use a guide wire to aim.
QuadmasterXLII•16h ago
deadso•16h ago
keeda•14h ago
aarond0623•13h ago
There second one takes a mathematical model for the path integral for light and portrays it like that's actually what is happening, with plenty of phrases like light "chooses" the path of least action that imply something more going on. Also, the experiment at the end with the laser pointer is awful. The light we are seeing is scattering from the laser pointer's aperture, not some evidence that light is taking alternate paths.
ChadNauseam•4h ago
Many people said this, but he set up an experiment to test it and the light does turn on instantly as claimed: https://www.youtube.com/watch?v=oI_X2cMHNe0
> There second one takes a mathematical model for the path integral for light and portrays it like that's actually what is happening
I know nothing about this. Is there a more accurate mathematical model available than the one he uses? Otherwise, I think it seems sensible to portray our best mathematical model as "what's really going on". And I didn't get the sense that light was "choosing" anything when watching the video, I got the sense that the amplitudes of all possible paths were cancelling out except for the shortest path (or something along those lines)
ww520•16h ago
teaearlgraycold•16h ago
javier2•15h ago
edwardbernays•15h ago
See: Brian Keating licking Eric Weinstein's jock strap in public and then offering mild criticism on Piers Morgan.
pstuart•13h ago
If transparent enough (and not from an abhorrent source), I'd be ok with his product. He's even allowed to make the occasional mistake as long as he properly owns up to it.
Theres been a lot of valuable learning from him and it would be a pity to dismiss it all over a single fumble.
anitil•9h ago
rowanG077•1h ago
alpaca128•1h ago
You can, actually, with a simple rule of thumb: if it's being advertised on YouTube, it's statistically low quality or a scam. The sheer number of brands that sponsor videos just to be exposed later for doing something shady is just too high.
rowanG077•30m ago
jgalt212•10h ago
> I wonder if Markov chains could predict how many times Veritasium changes the thumbnail and title of this video.
anitil•9h ago
jgalt212•8h ago
0xDEAFBEAD•9h ago
Consider the nuclear reaction metaphor. It's clearly not memoryless. Eventually you'll run out of fissile material.
The diagram for that example is bad as well. Do arrows correspond to state transitions? Or do they correspond to forks in the process where one neutron results in two?
mrlongroots•8h ago
I think no real process is memoryless: time passes/machines degrade/human behaviors evolve. It is always an approximation that is found/assumed to hold within the modeled timeframe.
0xDEAFBEAD•7h ago
mrlongroots•4h ago
https://en.wikipedia.org/wiki/Neutron_transport