frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Removing newlines in FASTA file increases ZSTD compression ratio by 10x

https://log.bede.im/2025/09/12/zstandard-long-range-genomes.html
65•bede•2d ago

Comments

rini17•2d ago
This might in general be a good preprocessing step to check for punctuation repeating in fixed intervals and remove it, and restore after decompression.
bede•2d ago
Yes, it sounds like 7-Zip/LZMA can do this using custom filters, among other more exotic (and slow) statistical compression approaches.
vintermann•1h ago
That turns in into specialized compression, which DNA already has plenty of. Many forms of specialized compression even allow string-related queries directly on the compressed data.
Kim_Bruning•2h ago
Now I'm wondering why this works. DNA clearly has some interesting redundancy strategies. (it might also depend on genome?)
dwattttt•2h ago
The FASTA format stores nucleotides in text form... compression is used to make this tractable at genome sizes, but it's by no means perfect.

Depending on what you need to represent, you can get a 4x reduction in data size without compression at all, by just representing a GATC with 2 bits, rather than 8.

Compression on top of that "should" result in the same compressed size as the original text (after all, the "information" being compressed is the same), except that compression isn't perfect.

Newlines are an example of something that's "information" in the text format that isn't relevant, yet the compression scheme didn't know that.

hyghjiyhu•1h ago
I think one important factor you missed to account for is frameshifting. Compression algorithms work on bytes - 8 bits. Imagine that you have the exact same sequence but they occur at different offsets mod 4. Then your encoding will give completely different results, and the compression algorithm will be unable to make use of the repetition.
vintermann•1h ago
This is a dataset of bacterial DNA. Any two related bacteria will have long strings of the same letters. But it won't be neatly aligned, so the line breaks will mess up pattern matching.
bede•1h ago
Exactly. The line breaks break the runs of otherwise identical bits in identical sequences. Unless two identical subsequences are exactly in phase with respect to their line breaks, the hashes used for long range matching are different for otherwise identical subsequences.
leobuskin•1h ago
What about a specialized dict for FASTA? Shouldn't it increase ZSTD compression significantly?
bede•1h ago
Yes I'd expect a dict-based approach to do better here. That's probably how it should be done. But --long is compelling for me because using it requires almost no effort, it's still very fast, and yet it can dramatically improve compression ratio.
mfld•1h ago

    Using larger-than-default window sizes has the drawback of requiring that the same --long=xx argument be passed during decompression reducing compatibility somewhat.
Interesting. Any idea why this can't be stored in the metadata of the compressed file?
nolist_policy•1h ago
It uses more memory (up to +2gb) during decompression as well -> potential DoS.
Aachen•54m ago
Sending a .zip filled with all zeroes, so it compresses extremely well, is a well-known DoS historically (zip bomb, making the server run out of space in trying to read the archive)

You always need resource limits when dealing with untrusted data. RAM is one of the obvious ones. They could introduce a memory limit parameter; require passing --long with a value equal to or greater than what the stream requires to successfully decompress; require seeking support for the input stream so they can look back that way (TMTO); fall back to using temp files; or interactively prompt the user if there's a terminal attached. Lots of options, each with pros and cons of course, that would all allow a scenario where the required information for the decoder is stored in the compressed data file

lifthrasiir•47m ago
It is stored in the metadata [1], but anything larger than 8 MiB is not guaranteed to be supported. So there has to be an out-of-band agreement between compressor and decompressor.

[1] https://datatracker.ietf.org/doc/html/rfc8878#name-window-de...

pbronez•20m ago
Seems useful for games marketplaces like Steam and Xbox. You control the CDN and client, so you can use tricky but effective compression settings all day long.
ashvardanian•44m ago
Nice observation!

Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)

I’ve worked with large genomic datasets on my own dime, and the default formats show their limits quickly. With FASTA, the first step for me is usually conversion: unzip headers from sequences, store them in Arrow-like tapes for CPU/GPU processing, and persist as Parquet when needed. It’s straightforward, but surprisingly underused in bioinformatics — most pipelines stick to plain text even when modern data tooling would make things much easier :(

Aachen•42m ago
I've also noticed this. Zstandard doesn't see very common patterns

For me it was an increasing number (think of unix timestamps in a data logger that stores one entry per second, so you are just counting up until there's a gap in your data), in the article it's a fixed value every 60 bytes

Of course, our brains are exceedingly good at finding patterns (to the point where we often find phantom ones). I was just expecting some basic checks like "does it make sense to store the difference instead of the absolute value for some of these bytes here". Seeing as the difference is 0 between every 60th byte in the submitted article, that should fix both our issues

Bzip2 performed much better for me but it's also incredibly slow. If it were only the compressor, that might be fine for many applications, but also decompressing is an exercise in patience so I've moved to Zstandard at the standard thing to use

pajko•32m ago
Bzip2 performs exactly better because it rearranges the input to achieve better pattern matches: https://en.m.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_tran...
semiinfinitely•41m ago
FASTA is a candidate for the stupidest file format ever invented and a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.
semiinfinitely•34m ago
other file formats that rival fasta in stupidity include fastq pdb bed sam cram vcf. further reading [1]

> "intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques"

1. https://madhadron.com/science/farewell_to_bioinformatics.htm...

Fraterkes•32m ago
I’ll do you the immense favor of taking the bait. What’s so bad about it?
jefftk•39m ago
The FASTA format looks like:

    > title
    bases with optional newlines
    > title
    bases with optional newlines
    ...
The author is talking about removing the non-semantic optional newlines (hard wrapping), not all the newlines in the file.

It makes a lot of sense that this would work: bacteria have many subsequences in common, but if you insert non-semantic newlines at effectively random offsets then compression tools will not be able to use the repetition effectively.

AndrewOMartin•12m ago
The compression ratio will likely skyrocket if you sorted the list of bases.
IshKebab•34m ago
Damn surely you stop using ASCII formats before your dataset gets to 2 TB??
FL33TW00D•18m ago
Looking forward to the relegation of FASTQ and FASTA to the depths of hell where they belong. Incredibly inefficient and poorly designed formats.

RustGPT: A pure-Rust transformer LLM built from scratch

https://github.com/tekaratzas/RustGPT
135•amazonhut•2h ago•49 comments

Removing newlines in FASTA file increases ZSTD compression ratio by 10x

https://log.bede.im/2025/09/12/zstandard-long-range-genomes.html
65•bede•2d ago•25 comments

Folks, we have the best π

https://lcamtuf.substack.com/p/folks-we-have-the-best
130•fratellobigio•5h ago•43 comments

Language Models Pack Billions of Concepts into 12k Dimensions

https://nickyoder.com/johnson-lindenstrauss/
225•lawrenceyan•8h ago•73 comments

Betty Crocker broke recipes by shrinking boxes

https://www.cubbyathome.com/boxed-cake-mix-sizes-have-shrunk-80045058
424•Avshalom•14h ago•493 comments

PythonBPF – Writing eBPF Programs in Pure Python

https://xeon.me/gnome/pythonbpf/
74•JNRowe•3d ago•13 comments

Human writers have always used the em dash

https://www.theringer.com/2025/08/20/pop-culture/em-dash-use-ai-artificial-intelligence-chatgpt-g...
25•FromTheArchives•2d ago•6 comments

Celestia – Real-time 3D visualization of space

https://celestiaproject.space/
84•LordNibbler•6h ago•17 comments

Grapevine canes can be converted into plastic-like material that will decompose

https://www.sdstate.edu/news/2025/08/can-grapevines-help-slow-plastic-waste-problem
344•westurner•14h ago•265 comments

Which colours dominate movie posters and why?

https://stephenfollows.com/p/which-colours-dominate-movie-posters-and-why
135•FromTheArchives•2d ago•21 comments

NASA's Guardian Tsunami Detection Tech Catches Wave in Real Time

https://www.jpl.nasa.gov/news/nasas-guardian-tsunami-detection-tech-catches-wave-in-real-time/
41•geox•2d ago•6 comments

Sandboxing Browser AI Agents

https://www.earlence.com/blog.html#/post/cellmate
25•earlence•3d ago•0 comments

The $10 Payment That Cost Me $43.95 – The Madness of SaaS Chargebacks

https://medium.com/@citizenblr/the-10-payment-that-cost-me-43-95-the-madness-of-saas-chargebacks-...
21•evermike•46m ago•23 comments

For Good First Issue – A repository of social impact and open source projects

https://forgoodfirstissue.github.com/
65•Brysonbw•10h ago•11 comments

Which NPM package has the largest version number?

https://adamhl.dev/blog/largest-number-in-npm-package/
108•genshii•9h ago•46 comments

Omarchy on CachyOS

https://github.com/mroboff/omarchy-on-cachyos
49•theYipster•7h ago•31 comments

Analyzing the memory ordering models of the Apple M1

https://www.sciencedirect.com/science/article/pii/S1383762124000390
97•charles_irl•3d ago•40 comments

A qualitative analysis of pig-butchering scams

https://arxiv.org/abs/2503.20821
122•stmw•8h ago•55 comments

You’re a slow thinker. Now what?

https://chillphysicsenjoyer.substack.com/p/youre-a-slow-thinker-now-what
445•sebg•4d ago•174 comments

Death to Type Classes

https://jappie.me/death-to-type-classes.html
40•zeepthee•3d ago•25 comments

Titania Programming Language

https://github.com/gingerBill/titania
93•MaximilianEmel•13h ago•41 comments

Learning Lens Blur Fields

https://blur-fields.github.io/
47•bookofjoe•3d ago•11 comments

A set of smooth, fzf-powered shell aliases&functions for systemctl

https://silverrainz.me/blog/2025-09-systemd-fzf-aliases.html
12•SilverRainZ•2d ago•5 comments

Why We Spiral

https://behavioralscientist.org/why-we-spiral/
320•gmays•21h ago•87 comments

OCSP Service Has Reached End of Life

https://letsencrypt.org/2025/08/06/ocsp-service-has-reached-end-of-life
195•pfexec•16h ago•61 comments

Writing an operating system kernel from scratch

https://popovicu.com/posts/writing-an-operating-system-kernel-from-scratch/
304•Bogdanp•20h ago•58 comments

Page Object (2013)

https://martinfowler.com/bliki/PageObject.html
30•adityaathalye•4d ago•20 comments

Introduction to GrapheneOS

https://dataswamp.org/~solene/2025-01-12-intro-to-grapheneos.html
205•renehsz•4d ago•198 comments

AMD Turin PSP binaries analysis from open-source firmware perspective

https://blog.3mdeb.com/2025/2025-09-11-gigabyte-mz33-ar1-blob-analysis/
66•pietrushnic•14h ago•12 comments

Trigger Crossbar

https://serd.es/2025/09/14/Trigger-crossbar.html
77•zdw•14h ago•11 comments