frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Om Malik has died

https://om.co/2026/06/24/1966-2026/
689•minimaxir•8h ago•66 comments

An entire Herculaneum scroll has been read for the first time

https://scrollprize.org/firstscroll
1114•verditelabs•13h ago•235 comments

Libre Barcode Project

https://graphicore.github.io/librebarcode/
46•luu•2h ago•1 comments

Framework's 10G Ethernet module exposes USB-C's complexity

https://www.jeffgeerling.com/blog/2026/framework-10g-ethernet-module-usb-c-complexity/
99•Alupis•4h ago•36 comments

Apple to skip high-end M6 Mac chips in favor of AI-focused M7 line

https://www.bloomberg.com/news/articles/2026-06-25/apple-to-skip-high-end-m6-mac-chips-to-launch-...
172•scrlk•11h ago•133 comments

What happened after 2k people tried to hack my AI assistant

https://www.fernandoi.cl/posts/hackmyclaw/
35•cuchoi•2h ago•5 comments

The 'papers, please' era of the internet will decimate your privacy

https://expression.fire.org/p/the-papers-please-era-of-the-internet
542•bilsbie•7h ago•244 comments

Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

65•spidy__•2d ago•31 comments

The Garbage Collection Handbook: The Art of Automatic Memory Management (2nd Ed) (2023)

https://gchandbook.org/
86•teleforce•6h ago•12 comments

Oxide computer 3D rack guided tour

https://explorer.oxide.computer/
335•darthcloud•3d ago•128 comments

A game where you're an OS and have to manage processes, memory and I/O events

https://github.com/plbrault/youre-the-os
141•exploraz•2d ago•26 comments

Un-0: Generating Images with Coupled Oscillators

https://unconv.ai/blog/introducing-un-0-generating-images-with-coupled-oscillators/
130•babelfish•8h ago•32 comments

IBM debuts sub-1 nanometer chip technology

https://newsroom.ibm.com/2026-06-25-ibm-debuts-worlds-first-sub-1-nanometer-chip-technology
289•porridgeraisin•13h ago•159 comments

Show HN: OpenKnowledge – open source AI-first alternative to Obsidian/Notion

https://github.com/inkeep/open-knowledge
235•engomez•13h ago•112 comments

Doing a masters while working in Spain

https://jan-herlyn.com/blog/doing-a-masters-while-working/
13•MHard•3d ago•1 comments

Eyewitness at the Triangle (1911)

http://trianglefire.ilr.cornell.edu/index.html
16•NaOH•3d ago•1 comments

Show HN: Chess-Inspired Roguelike

https://princechazz.com
248•cowboy_henk•4d ago•82 comments

An oral history of Bank Python (2021)

https://calpaterson.com/bank-python.html
91•tosh•9h ago•27 comments

The Doorman's Fallacy in action

https://rozumem.xyz/posts/17
85•rozumem•9h ago•113 comments

Parallel Parentheses Matching

https://williamdue.github.io/blog/parallel-parentheses-matching
73•Athas•9h ago•9 comments

OS9Map

https://yllan.org/software/OS9Map/
208•LaSombra•14h ago•39 comments

Zig's new bitCast semantics and LLVM back end improvements

https://ziglang.org/devlog/2026/#2026-06-25
223•kouosi•14h ago•104 comments

Apple raises prices of MacBooks, iPads

https://www.reuters.com/world/asia-pacific/apple-raises-prices-macbooks-ipads-memory-costs-skyroc...
673•virgildotcodes•16h ago•965 comments

Experiments in Sports Seismology for the World Cup

https://pnsn.org/blog/experiments-in-sports-seismology-for-the-world-cup
17•jmward01•4d ago•0 comments

Record type inference for dummies

http://haskellforall.com/2026/06/record-type-inference-for-dummies
27•g0xA52A2A•2d ago•0 comments

The last Romans are still around

https://signoregalilei.com/2026/06/20/the-last-romans-are-still-around/
61•surprisetalk•3d ago•79 comments

A data race that doesn't compile

https://corentin-core.github.io/posts/ruxe-type-level-disjointness/
23•stmw•3h ago•6 comments

Besimple AI (YC P25) Is Hiring

https://www.ycombinator.com/companies/besimple-ai/jobs/yWfhhOR-strategic-projects-lead-audio-data
1•yzhong94•12h ago

You can't unit test for taste

https://dev.karltryggvason.com/you-cant-unit-test-for-taste/
261•kalli•1d ago•118 comments

Hey Nico, you didn't vibe code your data room but stole it from Papermark

https://twitter.com/mfts0/status/2070080422482977095
239•mmunj•16h ago•95 comments
Open in hackernews

Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

64•spidy__•2d ago
I built an experiment that uses an overfitted transformer and arithmetic coding to compress individual files.

Instead of training the model to generalize, I train a 900KB transformer to memorize a single file and predict the next byte. Those predictions are fed into an arithmetic coder to produce the compressed output.

On a 100MB NYC taxi CSV, it compresses to about 7MB (~0.5 bits/byte). On a 100MB slice of enwik9, it compresses to about 21MB (~1.68 bits/byte).

It's pretty slow right now (roughly 20–30 minutes of training and 45 minutes each for compression and decompression on my AMD 7800XT).

Checkout the repo - https://github.com/samyak112/pym-particles

Comments

7373737373•2d ago
What does it compress the full 1GB file to? http://prize.hutter1.net/
spidy__•2d ago
I tried it on a enwik9 100 mb slice and was able to compress it to 20 mb + 900kb transformer so 21mb.

I know the top submission was able to get it to 13 mb.

Still trying some ideas to get better compression.

gravypod•1h ago
Since you know the size of the file beforehand you may be able to overfit some kind of text diffusion model instead of a transformer? May allow you to partially correct the model output using some other method and then fill in the blanks that were wrong from previous generations.
purple-leafy•7h ago
Thanks for the link!
cellular•1h ago
Maybe everyone should compress the 1st 100MB worth of digits of pi, for an apples-to-apples comparison?

Edit: oh wait that's too easy. Need to generate /publish random digits so everyone can use it.

saulpw•1h ago
random digits aren't compressible though?
SV_BubbleTime•1h ago
Random digits are compressible though.

Random data does not mean it does not match a pattern in your dictionary for example.

gnabgib•1h ago
No.. they're not. Do you understand random (the apparent or actual lack of definite patterns or predictability[0]) or compression (reduces bits by identifying and eliminating statistical redundancy[1])?

[0]: https://en.wikipedia.org/wiki/Randomness

[1]: https://en.wikipedia.org/wiki/Data_compression

thin_carapace•53m ago
by this definition, a random dataset could apparently present no patterns, while presenting non apparent patterns.
ufocia•
purple-leafy•1d ago
That’s so awesome! I want to try something similar. I’ve been going crazy with compression work. I reckon I can beat that prize link
spidy__•21h ago
Reallly?? So have you published something so far? Can i read something? Sounds like you got some interesting ideas.
purple-leafy•10h ago
I will be showcasing something on hackernews soon! Basically I found a way to “compress” a multiplayer game state from ~100KB+ to ~1KB

But it’s only for the game I’m building and it’s not pure compression work, I had to do some tricky things

purple-leafy•7h ago
And just for comparison, my absolute best compression method managed to get down to 10s of KB, but the real unlock got to the ~1KB figures. Note these numbers are ALL post-compression numbers. This is not raw data vs compressed data. The ~100KB figure IS POST COMPRESSION.

For context these numbers are for a grid based game where players can perform 4 actions per second, and the numbers I’m sharing are for 30 minutes of gameplay with anywhere from 2-1024+ players (human players) playing simultaneously

So if you do the math, my compression feat is effectively ~99% compression on naive best case. And if you compare it to the raw data, it’s closing in on an even higher number than that I haven’t done the math but the raw data is another factor of 10 greater than ~100KB so the “compression” versus raw data is ~99.9%

It sounds absolutely bullshit I know :D

But I will be posting a blog post soon once I release the game.

I do compression in quotes because it’s not a pure compression feat, the 99%+ feat is effectively being clever about what actually requires compression to achieve the same outcome

tae0086•1d ago
Neat approach. Since the 900KB model ships with the compressed file, is there a file size below which the model overhead just eats the gains? Curious where the crossover is.
spidy__•20h ago
For the model overhead to become significant enough to eat into the gains, the file size would need to be fairly small, right? I assumed nobody would use this for compressing anything below 100 MB.

I tested with 100 MB files because anything larger takes a long time to evaluate. The actual target was at least 1 GB, and in that case I would use a 100 MB model (Shannon entropy rules).

I also tried it on a 100 MB Photoshop file and was able to compress it down to 45 MB, whereas ZIP could only get it down to 60 MB. So yeah still not losing gains.

userbinator•1h ago
Fabrice Bellard may have been the first to do this, 7 years ago: https://news.ycombinator.com/item?id=27244004
SubiculumCode•1h ago
What do those compress to with conventional approaches? For comparison.

I am curious. A classic machine learning ensemble approach is to overfit a collection of small models then bag them (e.g. voting) allowing the models to generalize.

I'm sure someone's tried to overfit a bunch of transformers for compression like this, then bag them to see how well it does?

gwern•28m ago
Ensembling is not compute or parameter-efficient, so compression per se is a terrible application. (This is related to why people train ever larger LLMs like 1 10t-parameter LLM, rather than 100 GPT-3-scale LLMs.)
wildstrawberry•1h ago
Three questions:

1. How much was AI used to generate documentation for this project?

2. The 100MB CSV data sources are not provided in the repo so it doesn't seem possible to reproduce your results. The enwik9 dataset says it is a "slice" of the larger data set, and there are many NYC taxi trip record datasets that exist. Can you provide the datasets used to generate your results?

3. I am surprised to see performance comparisons only between your transformer and WinZIP. What were your results when comparing your transformer to more modern approaches like LZMA2 (level 9), BZIP2 and ZPAQ (max effort)?

rtpg•47m ago
I've had this idea of building a codec that would similarly overfit to specific images. But the codec itself would not be a fixed size transformer... instead you could just mess around with the sizing to get better quality/smaller size.

So the codec would be something like: <header describing image size + transformer layer shape> <transformer data itself>

I've seen experiments where people have a "fixed" pipeline but I think having something more dynamic would work quite well.

IncreasePosts•19m ago
Isnt this what auto encoders are for?
jxmorris12•18m ago
Lo and behold, a nice arithmetic coding implementation that wasn't written by an LLM! A sight for sore eyes – a treat, even. Looks like it was written by someone else though.

Check it out: https://github.com/samyak112/pym-particles/blob/main/arithme...

jmspring•16m ago
The model is the important part, a huffman code or adaptive huffman or other sorts of encoders would be much better on a dataset based on the model. You need the model to also decode. And on a dataset of sufficient size, embedding the model and the benefit of it's memorization of the file can be offset.

A non-general compression algorithm (model - I don't mean a distinct llm, but "modeling data") targeted at a specific dataset will always do better than a general algorithm.

The reason I mentioned the "encoder" doesn't matter - arithmetic coding, for the data it is presented, will beat huffman/adaptive huffman every day, but it's the model that is where the real "compression" comes into play.

I've implemented enough "coders" over the years, including arithmetic for both commercial and research purposes (was a student of Glen Langdon).

purple-leafy•16m ago
Dumb question: can you train a model to predict the next byte of ANOTHER MODEL

So apply this same logic to compressing a bigger model within a smaller model

I know this is absolutely regarded, but humour me please

anyg•9m ago
Not dumb at all. It's a whole field of active research - Speculative Decoding. A recent paper goes one level deeper with Speculative Speculative Decoding - https://arxiv.org/abs/2603.03251
purple-leafy•7m ago
Oh man awesome! I’m so S-M-R-T

Compression is such an interesting field

17m ago
Sounds like presenting no patterns, apparently or otherwise, would be a pattern in itself.
spidy__•3h ago
Sounds interesting man, soo am a bit confused maybe but can you run this on enwik9?
purple-leafy•1h ago
Probably not lol, it’s very specific to PvP multiplayer games, tested on my own game. But maybe I can extract the core concept to enwiki9 but I doubt it
andai•1h ago
I was working on a multiplayer game a while ago, and one of the iterations of the netcode was "thin client" where clients just sent input, server simulated the game, and it dumped world state onto the pipe at 60hz. I didn't ship that version but I estimated a $3000 bandwidth bill with that approach!

I started looking into diffing the state, compression, etc... until I realized, wait a minute! My player movement is linear so I only need a packet for start and stop! And so I achieved near infinite efficiency improvement :)

I think the word is... a specialized solution can beat a general one.

Also, "remembering what the program actually needs to do, and just making it do that"... I de-pessimized the netcode: https://youtube.com/watch?v=pgoetgxecw8

purple-leafy•30m ago
$3000 bill wow!!

Clever insight :) yes a specialised solution usually wins! Good effort

Did you end up publishing your game?