frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

LLMZip: Lossless Text Compression Using Large Language Models

https://arxiv.org/abs/2306.04050
1•jfantl•6h ago

Comments

hamsic•5h ago
"Lossless" does not mean that the LLM can accurately reconstruct human-written sentences. Rather, it means that the LLM generates a fully reproducible bitstream based on its own predicted probability distribution.

Reconstructing human-written sentences accurately is impossible because it requires modeling the "true source"—the human brain state (memory, emotion, etc.)—rather than the LLM itself.

Instead, a practical approach is to reconstruct the LLM output itself based on seeds or to store it in a compressible probabilistic structure.

DoctorOetker•3h ago
Its unclear what you claim lossless compression does or doesn't do, especially since you tie in storing an RNG's seed value at the end of your comment.

"LLMZip: Lossless Text Compression Using Large Language Models"

Implies they use the LLM's next token probability distribution to bring the most likely ones up for the likelihood sorted list of tokens (the higher the next token from the input stream -generated by humans or not- the fewer bits needed to encode its position starting the count from top to bottom, so the better the LLM can predict the true probability of the next token, the better it will be able to compress human-generated text in general)

Do you deny LLM's can be used this way for lossless compression?

Such a system can accurately reconstruct the uncompressed original input text (say generated by a human) from its compressed form.

hamsic•2h ago
Sure, a model-based coder can losslessly compress any token stream. I just meant that for human-written text, the model’s prediction diverges from how the text was actually produced — so the compression is formally lossless, but not semantically faithful or efficient.
DoctorOetker•3h ago
This is from 2023 (not a complaint, just observing that the result might be stale and even lower upper bounds may have been achieved).

Its quite curious to consider the connection between compression and intelligence. It's hard to quantify comprehension, i.e. how do you see if a system effectively comprehends some data? Lossless compression rates are very attractive, since the task is to not lose data but squeeze it as close as possible to its information content.

It does raise other questions though: which corpus is considered representative? A base model without finetuning might be more vulgar but also more effective at compressing the comparatively vulgar corpus. The corpus the corpus expressed by an RLHF/whatever reinforced and pretty-prompted chatbot however will be very good at compressing its own outputs but less good at compressing the actual vile human corpus, although both the base model and the aligned model will be relatively good at compressing each others output as well, they will each excel at compressing their own implicit corpus.

Another question: as the bits/per character upper bound falls monotonically it will suffer diminishing returns. How does one square that with the proposal that lossless compression corresponds to intelligence? It would clearly not be a linear correspondence, and it suggests that one would need exponentially larger and larger corpus to beat the prior compression rates.

How long can it write before repeating itself?

====

It also raises lots of societal questions: less than 1 bit per character, how many characters in library genesis / anna's archive etc?

In Search of the Simpsonville Massacre

https://www.nytimes.com/2025/11/04/science/archaeology-simpsonville-civil-war.html
1•quapster•1m ago•0 comments

Show HN: PivotHire – Project delivery service as easy as e-commerce platforms

https://www.pivothire.tech/
1•CLCKKKKK•6m ago•0 comments

Antarctic glacier saw the fastest retreat in history; trouble for sea levels

https://www.cnn.com/2025/11/03/climate/antarctic-glacier-hektoria-rapid-melt-sea-level
1•benkan•17m ago•0 comments

Ultra-HD televisions not noticeably better for typical viewer, scientists say

https://www.theguardian.com/technology/2025/oct/27/ultra-hd-televisions-4k-8k-not-noticeably-bett...
1•benkan•19m ago•0 comments

Ask HN: How to detect Google Play review completion via API?

1•orangepush•20m ago•0 comments

Sprout by Edera: UEFI Bootloader in Rust

https://github.com/edera-dev/sprout
1•nar001•21m ago•0 comments

Unofficial Microsoft Teams Client for Linux

https://github.com/IsmaelMartinez/teams-for-linux
1•basemi•23m ago•0 comments

The Joys and Importance of Reading

https://fromthethornveld.co.za/the-joys-and-importance-of-reading/
2•Gigamouse•24m ago•0 comments

What If Java Had Symmetric Converter Methods on Collection?

https://donraab.medium.com/what-if-java-had-symmetric-converter-methods-on-collection-cbb824885c3f
1•xkriva11•27m ago•0 comments

Show HN: Fixing a VSCode performance issue present since the first commit

https://github.com/microsoft/vscode/pull/274994
1•anticensor•27m ago•0 comments

300 Most Cited Research Books of All Time? (2018)

https://newtab@websitefinder.wordpress.com/2018/10/30/the-300-most-cited-research-books-of-all-time/
1•salkahfi•33m ago•0 comments

We launched deposit insurance in Korea–for foreign renters

https://foreignerhome.com/en/insurance/deposit
3•FOHO•33m ago•3 comments

How Airbus Took Off

https://worksinprogress.co/issue/how-airbus-took-off/
1•LaFolle•35m ago•0 comments

Cosmic version 1 to hit desktops on December 11

https://www.theregister.com/2025/11/03/cosmic_1_before_xmas/
2•gostsamo•37m ago•0 comments

Design for AI

https://www.thesigma.co/human-computer-interaction/human-ai-interaction
1•ameeromidvar•38m ago•0 comments

Lessons from 70 interviews on deploying AI Agents in production

https://mmc.vc/research/state-of-agentic-ai-founders-edition/
3•advikipedia•38m ago•1 comments

Reverse Engineering a Neural Network's Clever Solution to Binary Addition (2023)

https://cprimozic.net/blog/reverse-engineering-a-small-neural-network/
1•Ameo•42m ago•0 comments

There is no such thing as conscious artificial intelligence – Nature

https://www.nature.com/articles/s41599-025-05868-8
1•air7•44m ago•1 comments

Tell HN: The biggest startup event in Switzerland in this week, Nov 6-7

https://www.startup-nights.ch/
1•sschueller•46m ago•0 comments

Ask HN: How opiniated Is HN?

2•janikvonrotz•48m ago•3 comments

Show HN: Neustream – Multistream to all platforms from one place

https://neustream.app
1•thefarseen•54m ago•0 comments

F1 Hotlap Daily

https://www.hotlapdaily.com/
3•paprika_chan•56m ago•0 comments

Indian university's `indigenous` PostgreSQL distribution

https://shaktidb.iitmpravartak.net/
1•tachyons•57m ago•0 comments

The AI Localhost

https://getairbook.notion.site/The-AI-Localhost-2a1d4a82803d802a8753ffbcfa985664?source=copy_link
1•Hoshang07•1h ago•1 comments

Why isn't there a universal + standard VoIP/data SMS-like message protocol?

1•abcqwerty9876•1h ago•0 comments

Europe's Decentralized Messaging Survives "Chat Control" Threat

https://www.process-one.net/blog/decentralized-messaging-survives-chat-control-threat/
1•neustradamus•1h ago•0 comments

My Experience as a SDE Intern at AWS

https://simho.xyz/blogs/my-experience-as-a-sde-intern-at-aws/
1•simho•1h ago•0 comments

Samsung Teams with Nvidia, New AI Megafactory

https://news.samsung.com/global/samsung-teams-with-nvidia-to-lead-the-transformation-of-global-in...
1•tmikaeld•1h ago•0 comments

What happens at the Planck Length? [video]

https://www.youtube.com/watch?v=f3jhbui5Cqs
1•o4c•1h ago•0 comments

Show HN: Built a soundboard for my haunted garage with AI assistance

https://theworstofboth.com/hauntedgarage2025
1•JoeOfTexas•1h ago•0 comments