frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Cowork: Claude Code for the rest of your work

https://claude.com/blog/cowork-research-preview
503•adocomplete•4h ago•259 comments

TimeCapsuleLLM: LLM trained only on data from 1800-1875

https://github.com/haykgrigo3/TimeCapsuleLLM
439•admp•7h ago•186 comments

Fabrice Bellard's TS Zip (2024)

https://www.bellard.org/ts_zip/
79•everlier•3h ago•27 comments

The chess bot on Delta Air Lines will destroy you (2024) [video]

https://www.youtube.com/watch?v=c0mLhHDcY3I
123•cjaackie•3h ago•65 comments

Postal Arbitrage

https://walzr.com/postal-arbitrage
223•The28thDuck•6h ago•111 comments

Unauthenticated remote code execution in OpenCode

https://cy.md/opencode-rce/
197•CyberShadow•1d ago•45 comments

Date is out, Temporal is in

https://piccalil.li/blog/date-is-out-and-temporal-is-in/
285•alexanderameye•8h ago•89 comments

LLVM: The bad parts

https://www.npopov.com/2026/01/11/LLVM-The-bad-parts.html
264•vitaut•9h ago•52 comments

F2 (YC S25) Is Hiring

https://www.ycombinator.com/companies/f2/jobs/cJsc7Fe-product-designer
1•arctech•1h ago

Show HN: AI in SolidWorks

https://www.trylad.com
110•WillNickols•6h ago•54 comments

Floppy disks turn out to be the greatest TV remote for kids

https://blog.smartere.dk/2026/01/floppy-disks-the-best-tv-remote-for-kids/
470•mchro•10h ago•276 comments

Show HN: Agent-of-empires: OpenCode and Claude Code session manager

https://github.com/njbrake/agent-of-empires
47•river_otter•9h ago•12 comments

Perlsecret – Perl secret operators and constants

https://metacpan.org/dist/perlsecret/view/lib/perlsecret.pod
49•mjs•6d ago•8 comments

'I rarely get outside': scientists ditch fieldwork in the age of AI

https://www.nature.com/articles/d41586-025-04150-w
11•Growtika•4d ago•3 comments

What old tennis players teach us (2017)

https://www.raphkoster.com/2017/09/22/31098/
27•surprisetalk•4d ago•17 comments

Message Queues: A Simple Guide with Analogies (2024)

https://www.cloudamqp.com/blog/message-queues-exaplined-with-analogies.html
69•byt3h3ad•6h ago•20 comments

GitHub: A case study in link maintenance and 404 pages (2013)

https://chrismorgan.info/blog/github-links-case-study/
9•roryokane•5d ago•1 comments

Apple picks Google's Gemini to power Siri

https://www.cnbc.com/2026/01/12/apple-google-ai-siri-gemini.html
593•stygiansonic•8h ago•331 comments

Non-Essential French Embassy Staff Have Left Iran

https://www.barrons.com/news/non-essential-french-embassy-staff-have-left-iran-sources-d84d1f51
19•mhb•47m ago•4 comments

Anthropic made a mistake in cutting off third-party clients

https://archaeologist.dev/artifacts/anthropic
198•codesparkle•12h ago•167 comments

Show HN: Fall asleep by watching JavaScript load

https://github.com/sarusso/bedtime
41•sarusso•5h ago•14 comments

Superhuman AI exfiltrates emails

https://www.promptarmor.com/resources/superhuman-ai-exfiltrates-emails
29•takira•5h ago•3 comments

Building a 25 Gbit/s workstation for the SCION Association

https://github.com/scionassociation/blog-25gbit-workstation
61•romshark•7h ago•23 comments

Ansible battle tested hardening for Linux, SSH, Nginx, MySQL

https://github.com/dev-sec/ansible-collection-hardening
41•walterbell•5d ago•10 comments

Ai, Japanese chimpanzee who counted and painted dies at 49

https://www.bbc.com/news/articles/cj9r3zl2ywyo
168•reconnecting•14h ago•57 comments

Zen-C: Write like a high-level language, run like C

https://github.com/z-libs/Zen-C
147•simonpure•10h ago•90 comments

Reproducing DeepSeek's MHC: When Residual Connections Explode

https://taylorkolasinski.com/notes/mhc-reproduction/
96•taykolasinski•9h ago•29 comments

Launch a Debugging Terminal into GitHub Actions

https://blog.gripdev.xyz/2026/01/10/actions-terminal-on-failure-for-debugging/
127•martinpeck•11h ago•53 comments

Personal thoughts/notes from working on Zootopia 2

https://blog.yiningkarlli.com/2025/12/zootopia-2.html
290•pantalaimon•5d ago•62 comments

Computers that used to be human

https://digitalseams.com/blog/computers-that-used-to-be-human
53•bobbiechen•8h ago•10 comments
Open in hackernews

Fabrice Bellard's TS Zip (2024)

https://www.bellard.org/ts_zip/
77•everlier•3h ago

Comments

publicdebates•2h ago
Bellard finally working with his true colleague.
dmitrygr•2h ago
"compressed size" does not seem to include the size of the model and the code to run it. According to the rules of Large Text Compression Benchmark, total size of those must be counted, otherwise a 0-byte "compressed" file with a decompressor containing the plaintext would win.
underdeserver•2h ago
Technically correct, but a better benchmark would be a known compressor with an unknown set of inputs (that come from a real-world population, e.g. coherent English text).
paufernandez•2h ago
Yeah, but the xz algorithm is also not counted in the bytes... Here the "program" is the LLM, much like your brain remembers things by coding them compressed and then reconstructs them. It is a different type of compression: compression by "understanding", which requires the whole corpus of possible inputs in some representation. The comparison is not fair to classical algorithms yet that's how you can compress a lot more (given a particular language): by having a model of it.
wrs•2h ago
“Compressors are ranked by the compressed size of enwik9 (10^9 bytes) plus the size of a zip archive containing the decompresser.” [0]

[0] https://www.mattmahoney.net/dc/text.html

FartyMcFarter•2h ago
True for competitions, but if your compression algorithm is general purpose then this matters less (within reason - no one wants to lug around a 1TB compression program).
MisterTea•2h ago
This is something I have been curious about in terms of how an LLM's achieves compression.

I would like to know what deviations are in the output as this almost feels like a game of telephone where each re-compression results in a loss of data which is then incorrectly reconstructed. Sort of like misremembering a story and as you tell it over time the details change slightly.

Scaevolus•2h ago
When LLMs predict the next token, they actually produce a distribution of the probability of each of the possible next tokens, and the sampler chooses one of them, and not necessarily the most likely one!

If instead you run LLM prediction and then encode the probability of the next token of the input text you want to encode (from the cumulative distribution, a number in [0, 1]) using arithmetic coding, you can run the same operation in reverse to achieve lossless compression.

The tricky part is ensuring that your LLM executes absolutely deterministically, because you need to make sure that the encoder and decoder have the same probability distribution map at each step.

AnotherGoodName•49m ago
Yes. The secret is in understanding arithmetic coding. https://en.wikipedia.org/wiki/Arithmetic_coding

Arithmetic coding takes a prediction of the next bit and writes out exactly as many bits as needed to correct that prediction. The amazing part is that you can write out fractional bits. Eg. You predict the next bit is '1' with 75% probability? If it is 1 you only need to write out 1/2 of a bit (correcting that 25% portion). If it's 0 you need to write out 2.4bits. It may seem strange to work with 1/2 a bit but it works! (essentially the other half of the bit represents other future correction required). You might have heard of huffman coding which can't deal with fractional bits, arithmetic coding is a generalization of huffman coding that can deal with this.

Arithmetic coding is mathematically perfect at what it does. You will not waste a single bit using this algorithm to encode data given a prediction of that data.

So the entirety of modern compression techniques don't deal with the encoding/decoding side at all. What they deal with is modelling the data so far and making the most accurate prediction possible on the next bit of data (next byte also works, but working 1 bit at a time is easier to comprehend when learning arithmetic coding).

Incidentally the encoders and decoders essentially work exactly the same. Given the data read or data decoded so far predict the next bit. This part is exactly the same either way. The decoder would read the compressed file for the correction and the encoder would read the input file and write out the correction. The important part is "predict the next bit". This is what separates all the different compressors.

This is also where those of us experienced in this area try to correct people on the wrong track. A compression algorithm is never about the encoding side but instead 100% always about the prediction of the data. Can you build a model that can accurately predict what the next data to come is? That's what you need to do to make a better file compressor. The entropy encoding part is a completely solved problem already, don't bother re-solving that.

wewewedxfgdf•2h ago
>> The ts_zip utility can compress (and hopefully decompress) text files

Hopefully :-)

hamandcheese•1h ago
Reading data is overrated. I highly recommend S4:

http://www.supersimplestorageservice.com/

benatkin•2h ago
I propose the name tokables for the compressed data produced by this. A play on tokens and how wild it is.
fancyswimtime•1h ago
please pass the tokables to the left hand side
shawnz•2h ago
Another fun application of combining LLMs with arithmetic coding is steganography. Here's a project I worked on a while back which effectively uses the opposite technique of what's being done here, to construct a steganographic transformation: https://github.com/shawnz/textcoder
meisel•2h ago
Looks like it beats everything in the large text compression benchmark for enwik8, but loses to several programs for enwik9. I wonder why that is.
AnotherGoodName•1h ago
It's actually not the best at enwik8 or 9.

The results at https://www.mattmahoney.net/dc/text.html explicitly add the size of the compressor itself to the result. Note the "enwik9+prog" column. That's what it's ranked on.

The reason to do this is that it's trivial to create a compressor that 'compresses' a file to 0 bytes. Just have an executable with a dictionary of enwik9 that writes that out given any input. So we always measure what is effectively the Kolmogorov complexity. The data+program as a whole that produces the result we want.

So those results add in the compressor size. The programs there generally have no dictionary built in or in the case of LLM based compressors, no pre-trained data. They effectively build the model as they process data. Not compressing much at all at the start and slowly compressing better and better as they go. This is why these programs do better and better with larger data sets. They start with 0 knowledge. After a GB or so they have very good knowledge of the corpus of human language.

This program here however is pre-trained and shipped with a model. It's 150MB in size! This means it has 150MB of extra starting knowledge over those models in that list. The top models in that list are the better compressors, they'll quickly out learn and overtake this compressor but they just don't have that headstart.

Of course measuring fairly this should be listed with that 150MB program size added to the results when doing a comparison.

srcreigh•1h ago
As an aside, I wonder how to account for the information content embedded in the hardware itself.

A Turing Machine compressor program would likely have more bytes than the amd64 binary. So how to evaluate KolmogorovComplexity(amd64)?

The laws of physics somehow need to be accounted for too, probably.

d_burfoot•52m ago
Kolmogorov Complexity is only defined up to a constant, which represents Turing machine translation length.
rurban•1h ago
So did beat his own leading program from 2019, nncp, finally.
egl2020•1h ago
When Jeff Dean gets stuck, he asks Bellard for help...
SnowProblem•1h ago
I love this because it gets to the heart of information theory. Shannon's foundational insight was that information is surprise. A random sequence is incompressible by definition. But what counts as surprise depends on context, and for text, we know a large amount of it is predictable slop. I suspect there's a lot of room to go along this style of compression. For example, maybe you could store an upfront summary that makes prediction more accurate. Or perhaps you could encode larger sequences or some kind of hierarchical encoding. But this is great.
bambax•1h ago
Yes! information is surprise, and that's why a measure of intelligence is the ability to predict.
oxag3n•1h ago
Compression and intelligence reminded me of the https://www.hutter1.net/prize

I've encountered it >10 years ago and it felt novel that compression is related to intelligence and even AGI.

omoikane•1h ago
Current leader of the Large Text Compression Benchmark is NNCP (compression using neural networks), also by Fabrice Bellard:

https://bellard.org/nncp/

Also, nncp-2024-06-05.tar.gz is just 1180969 bytes, unlike ts_zip-2024-03-02.tar.gz (159228453 bytes, which is bigger than uncompressed enwiki8).

gmuslera•1h ago
Reminded me of pi filesystem (https://github.com/philipl/pifs), with enough digits of pi precalculated you might be able to do a decent compression program. The trick is in the amount of reasonable digits for that, if it’s smaller or bigger than that trained LLM.
GuB-42•19m ago
I suspect that the length of the offset of your input data in pi is equal to the length of the input data itself, plus or minus a few bytes at most, regardless of the size of the input data.

That is: no compression, but it won't make things worse either.

Unless the input data is the digits of pi, obviously, or the result of some computation involving pi.

jokoon•48m ago
so barely 2 or 3 times better than xz

not really worth it