frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

The only U.S. particle collider shuts down

https://www.sciencenews.org/article/particle-collider-shuts-down-brookhaven
1•rolph•2m ago•0 comments

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

1•solarisos•2m ago•0 comments

Show HN: Remotion directory (videos and prompts)

https://www.remotion.directory/
1•rokbenko•4m ago•0 comments

Portable C Compiler

https://en.wikipedia.org/wiki/Portable_C_Compiler
1•guerrilla•6m ago•0 comments

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

1•Ginsabo•7m ago•0 comments

Software Engineering Transformation 2026

https://mfranc.com/blog/ai-2026/
1•michal-franc•8m ago•0 comments

Microsoft purges Win11 printer drivers, devices on borrowed time

https://www.tomshardware.com/peripherals/printers/microsoft-stops-distrubitng-legacy-v3-and-v4-pr...
2•rolph•8m ago•0 comments

Lunch with the FT: Tarek Mansour

https://www.ft.com/content/a4cebf4c-c26c-48bb-82c8-5701d8256282
2•hhs•12m ago•0 comments

Old Mexico and her lost provinces (1883)

https://www.gutenberg.org/cache/epub/77881/pg77881-images.html
1•petethomas•15m ago•0 comments

'AI' is a dick move, redux

https://www.baldurbjarnason.com/notes/2026/note-on-debating-llm-fans/
2•cratermoon•16m ago•0 comments

The source code was the moat. But not anymore

https://philipotoole.com/the-source-code-was-the-moat-no-longer/
1•otoolep•16m ago•0 comments

Does anyone else feel like their inbox has become their job?

1•cfata•16m ago•0 comments

An AI model that can read and diagnose a brain MRI in seconds

https://www.michiganmedicine.org/health-lab/ai-model-can-read-and-diagnose-brain-mri-seconds
2•hhs•20m ago•0 comments

Dev with 5 of experience switched to Rails, what should I be careful about?

1•vampiregrey•22m ago•0 comments

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

https://arxiv.org/abs/2601.16429
1•PaulHoule•23m ago•0 comments

Scientists discover “levitating” time crystals that you can hold in your hand

https://www.nyu.edu/about/news-publications/news/2026/february/scientists-discover--levitating--t...
2•hhs•25m ago•0 comments

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

https://www.youtube.com/watch?v=3VReIuv1GFo
1•erickhill•26m ago•0 comments

Tell HN: Yet Another Round of Zendesk Spam

2•Philpax•26m ago•0 comments

Postgres Message Queue (PGMQ)

https://github.com/pgmq/pgmq
1•Lwrless•29m ago•0 comments

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

https://github.com/kjnez/django-rclone
1•cui•32m ago•1 comments

NY lawmakers proposed statewide data center moratorium

https://www.niagara-gazette.com/news/local_news/ny-lawmakers-proposed-statewide-data-center-morat...
1•geox•34m ago•0 comments

OpenClaw AI chatbots are running amok – these scientists are listening in

https://www.nature.com/articles/d41586-026-00370-w
3•EA-3167•34m ago•0 comments

Show HN: AI agent forgets user preferences every session. This fixes it

https://www.pref0.com/
6•fliellerjulian•36m ago•0 comments

Introduce the Vouch/Denouncement Contribution Model

https://github.com/ghostty-org/ghostty/pull/10559
2•DustinEchoes•38m ago•0 comments

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

https://github.com/sultanvaliyev/sshcode
1•sultanvaliyev•38m ago•0 comments

Microsoft appointed a quality czar. He has no direct reports and no budget

https://jpcaparas.medium.com/microsoft-appointed-a-quality-czar-he-has-no-direct-reports-and-no-b...
2•RickJWagner•40m ago•0 comments

Multi-agent coordination on Claude Code: 8 production pain points and patterns

https://gist.github.com/sigalovskinick/6cc1cef061f76b7edd198e0ebc863397
1•nikolasi•41m ago•0 comments

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

https://www.nytimes.com/2026/02/07/technology/washington-post-will-lewis.html
14•jbegley•41m ago•3 comments

DevXT – Building the Future with AI That Acts

https://devxt.com
2•superpecmuscles•42m ago•4 comments

A Minimal OpenClaw Built with the OpenCode SDK

https://github.com/CefBoud/MonClaw
1•cefboud•42m ago•0 comments
Open in hackernews

LLM-Deflate: Extracting LLMs into Datasets

https://www.scalarlm.com/blog/llm-deflate-extracting-llms-into-datasets/
77•gdiamos•4mo ago

Comments

gdiamos•4mo ago
What open source code do you use to pull synthetic data from LLMs?
dvh•4mo ago
Remember those programming books that were like "1000+1 tips for C++", making those with llms would be trivial now.
chmod775•4mo ago
I wonder how many cycles of train->extract->train->extract->... you can do before most of your output will be hallucinations.
gdiamos•4mo ago
It would be an expensive experiment to perform.

I wonder how you could do it more efficiently?

apwell23•4mo ago
> This compression is lossy

Is compression really lossy? What is an example of lost knowledge?

sqeaky•4mo ago
Think about all the times in llm gets it wrong, the fact that would have helped to get it right is something that was lost. I suppose this isn't proof it's lossy just maybe we don't know how to get the data out.

Or look at it another way LLMs or just text prediction machines, whatever information doesn't help them predict the next token or conflicts with the likelihood of the next token is something that gets dropped.

Or look at it another way these things are often trained on the many terabytes of the internet yet even a 200 billion parameter network is 100 or 200 GB in size. So something is missing, and that is a way better compression ratio then the best known algorithms for lossless compression.

Or we can look at it another way, these things were never built to be lossless compression systems. We can know by looking at how these things are implemented that they don't retain everything they're trained on, they extract a bunch of statistics.

visarga•4mo ago
I think extraction from the model itself is a bad idea. But extraction from external sources, such as the deep research reports LLMs generate, or solving problems where we have validation of correctness is a good idea. The model is not validating its outputs by simply doing another inference, but consults external sources or gets feedback from code execution. Humans in chat rooms could also provide lots of learning signal, especially when actions are judged against the outcomes they cause down the line, using hindsight.

So in short what works is a model + a way to know its good outputs from bad ones.

kiicia•4mo ago
it's exactly the same as JPEG images being lossy, while you can see image as a whole (and it is enough for 99% of people), you are obviously missing some details

and the more you rely on those details (professional photography, scientific data) the more obvious it is (to the point of image being useless in some cases)

same with LLMs, we are currently testing how far we can go before we seeing obvious issues

apwell23•4mo ago
whats an example of loss?
aewens•4mo ago
Lossy compression vs lossless compression is the difference of whether you can get a 1:1 copy of the original data if you compress and then decompress it.

A simple example of this is if you have 4 bits of data and have a compression algorithm that turns it into 2 bits of data. If your dataset only contains 0000, 0011, 1100, and 1111; then this can technically be considered lossless compression because we can always reconstruct the exact original data (e.g. 0011 compresses to 01 and can decompress back to 0011, 1100 compresses to 10 and can decompress back to 1100, etc). However, if our dataset later included 1101 and got compressed to 10, this is now “lossy” because it would decompress to 1100, that last bit was “lost”.

An LLM is lossy compression because it lacks the capacity to 1:1 replicate all its input data 100% of the time. It can get quite close in some cases, sure, but it is not perfect every time. So it is considered “lossy”.

fxj•4mo ago
How good can you recreate an image that is described by words? Obviously not bit by bit and pixel by pixel. You get something that resembles the original but not an exact copy.
apwell23•4mo ago
you can create original exactly with right prompt
sfink•4mo ago
Yes. For example, you could always say "give me a jpeg image file that is encoded as the bytes 255, 216, 255, 224, 0, 16, 74, ...". But that's just pointing out that the input to your "LLM" function includes the prompt. It's f(model, prompt) = response.

It's not straightforward to prove that models have to be lossy. Sure, the training data is much larger than the model, but there is a huge amount of redundancy in the training data. You have to compare a hypothetically optimal compression of the training data to the size of the model to prove that it must be lossy. And yet, it's intuitively obvious that even the best lossless compression (measured in Kolmogorov complexity) of the training data is going to be vastly larger than the biggest models we have today.

You can always construct toy examples where this isn't the case. For example, you could just store all of the training data in your model, and train another part of the model to read it out. But that's not an LLM anymore. Similarly, you could make an LLM out of synthetic redundant data and it could achieve perfect recall. (Unless you're clever with how you generate it, though, any off the shelf compression algorithm is likely to produce something much much smaller.)

kiicia•4mo ago
no, the simplest example that it is not possible in practice is in heraldry, where blazon - that is description, yield different emblazon - that is depiction, depending on who and where creates crest

crest always stays true to descriptions but details are always different

now when I think about it, heraldry is practical way to describe how generative algorithms work

esafak•4mo ago
https://en.wikipedia.org/wiki/Lossy_compression
fxj•4mo ago
It is at least as lossy as jpeg compression. Details get lost and artifacts are generated.
shawntan•4mo ago
Not sure if you mean in general, but I'll answer both branches of the question.

In general: Depending on the method of compression, you can have lossy or non-lossy compression. Using 7zip on a bunch of text files can lossless-ly compress that data. Briefly, you calculate the statistics of the data you want to compress (the dictionary), and then make the commonly re-occuring chunks describable with fewer bits (encoding). The compressed file basically contains the dictionary and the encoding.

For LLMs: There are ways to use an LLM (or any statistical model of text) to compress text data. But the techniques use similar settings as the above, with a dictionary and an encoding, with the LLM taking the function of a dictionary. When "extracting" data from the dictionary alone, you're basically sampling from the dictionary distribution.

Quantitatively, the "loss" in "lossy" being described is literally the number of bits used for the encoding.

I wrote a brief description here of techniques from an undergrad CS course that can be used: https://blog.wtf.sg/posts/2023-06-05-yes-its-just-doing-comp...

gdiamos•4mo ago
LLMs have finite entropy (it is related to their training loss) and training typically doesn’t store the residuals.

Some compression methods use LLMs internally and also store the residuals, making them lossless.

gmerc•4mo ago
The law of thermodynamics would require it to be lossy
omneity•4mo ago
My gripe with an approach like this is the lack of any grounding to these generated topics. Hallucination accumulates like error in this case so every generation that is conditioned by a previous one (the recursive "hierarchical topic exploration" in TFA).

I suspect most of the "leafs" are unusable.

fxj•4mo ago
The question is: Is it like jpeg compression where the errors do not accumulate but the image comverges to a self inverse compressed image or does the data set converge to a single point which is meaningless?
rapatel0•4mo ago
The transformation function in jpeg (DCT) is generally well defined math. While lossy, most of the information is reprocudable.

An LLM is layers and layers of non-linear transformations. It's hard to say exactly how information is accumulated. You can inspect activations from tokens but it's really not clear how to define what the function is exactly doing. Therefore error is poorly understood.

ashf023•4mo ago
JPEG is similar actually. The DCT is invertible, but the result of the DCT is quantized, which is where some of the compression happens (DCT -> quantization -> IDCT), so the end to end process is not truly invertible. Maybe an analogy to the non-linearities in between the linear steps in deep learning
gdiamos•4mo ago
I think it would be interesting to deflate out to a huge dataset and see where this happens.

Certainly it will occur as the generated data exceeds the original, eg after 1-10T tokens.

I think you could also do this faster by moving down the tree in a depth first manner.

Typically I use this for knowledge transfer, style transfer, catastrophic forgetting mitigation, etc and so I don’t go very far. I usually manually review the data samples before using it.

estimator7292•4mo ago
Huh. I wonder what good output would look like at extremes. Hallucinations that just happen to be true or something more interesting?
gmerc•4mo ago
Not different for inference... Just saying.
imranq•4mo ago
The claims in this paper don't make sense. There is no proof that anything has been decompressed
dragonwriter•4mo ago
“Decompression” is a metaphor, not a fact claim to be proved; it is a description of an approach to generating a dataset from an LLM where most of the potential utility is still fairly explicitly speculative, a jumping off point for further work.
gmerc•4mo ago
Nope http://arxiv.org/abs/2509.11208
gmerc•4mo ago
About that http://arxiv.org/abs/2509.11208
moktonar•4mo ago
Wouldn’t this method be good if applied on humans in job interviews?
fxj•4mo ago
how long would it take to do a complete memory dump of your brain by voice stream? days? months? years?

this is more like writing one's autobiography.

gdiamos•4mo ago
There are some fun early theoretical ML papers on this topic.

They prove that it is possible to fully clone a brain based on this method.

I think one could theoretically estimate how many queries you would need to make to do it. The worst case is proportional to the number of parameters of the model, i.e. at least 10^15 for a human. At one minute per spoken sample, that comes out to about 2 billion years to clone one human.

I suspect it is not practical without advancements in neural link to increase the bandwidth by billions of times.

I personally like playing around with empirical methods like this blog post to understand the practical efficiency of our learning algorithms like back prop on transformers.

I also try not to invest too much effort into this topic given the ethical issues.

moktonar•4mo ago
I was thinking about selective knowledge exploration to see if the candidate is fit for the offered position. No need to dump everything
dragonwriter•4mo ago
> Wouldn’t this method be good if applied on humans in job interviews?

Uhm, no? I mean, some firms do abuse job interviews to pump candidates for usable information, and some have gotten a notable bad reputation for that which impacts their funnel of candidates, but from the article: “Generating comprehensive datasets requires thousands of model calls per topic”—you aren’t going to get a candidate to hang around for that...

moktonar•4mo ago
That is evil, no, I was thinking more about selective knowledge exploration to see if a candidate is fit for the position.
benob•4mo ago
My understanding was that the Alpaca data was a distillation from text-davinci-003
fxj•4mo ago
Learning == Compression of information.

It can be a description by a shorter bit length. Think Shannon Entropy and the measure of information content. The information is still in the weights but it is reorganized and the reconstructed sentences (or lists of tokens) will not provide the same exact bits but the information is still there.

shawntan•4mo ago
The compression is lossy.