frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

NASA now allowing astronauts to bring their smartphones on space missions

https://twitter.com/NASAAdmin/status/2019259382962307393
2•gbugniot•4m ago•0 comments

Claude Code Is the Inflection Point

https://newsletter.semianalysis.com/p/claude-code-is-the-inflection-point
1•throwaw12•6m ago•0 comments

MicroClaw – Agentic AI Assistant for Telegram, Built in Rust

https://github.com/microclaw/microclaw
1•everettjf•6m ago•2 comments

Show HN: Omni-BLAS – 4x faster matrix multiplication via Monte Carlo sampling

https://github.com/AleatorAI/OMNI-BLAS
1•LowSpecEng•7m ago•1 comments

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

https://codemanship.wordpress.com/2026/01/05/the-ai-ready-software-developer-conclusion-same-game...
1•lifeisstillgood•9m ago•0 comments

AI Agent Automates Google Stock Analysis from Financial Reports

https://pardusai.org/view/54c6646b9e273bbe103b76256a91a7f30da624062a8a6eeb16febfe403efd078
1•JasonHEIN•12m ago•0 comments

Voxtral Realtime 4B Pure C Implementation

https://github.com/antirez/voxtral.c
1•andreabat•14m ago•0 comments

I Was Trapped in Chinese Mafia Crypto Slavery [video]

https://www.youtube.com/watch?v=zOcNaWmmn0A
1•mgh2•20m ago•0 comments

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

https://www.cbp.gov/newsroom/stats/reported-employee-arrests
1•ludicrousdispla•22m ago•0 comments

Show HN: I built a free UCP checker – see if AI agents can find your store

https://ucphub.ai/ucp-store-check/
2•vladeta•27m ago•1 comments

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

https://github.com/thealidev/VectorVision-SVGV
1•thealidev•29m ago•0 comments

Study of 150 developers shows AI generated code no harder to maintain long term

https://www.youtube.com/watch?v=b9EbCb5A408
1•lifeisstillgood•29m ago•0 comments

Spotify now requires premium accounts for developer mode API access

https://www.neowin.net/news/spotify-now-requires-premium-accounts-for-developer-mode-api-access/
1•bundie•32m ago•0 comments

When Albert Einstein Moved to Princeton

https://twitter.com/Math_files/status/2020017485815456224
1•keepamovin•33m ago•0 comments

Agents.md as a Dark Signal

https://joshmock.com/post/2026-agents-md-as-a-dark-signal/
2•birdculture•35m ago•0 comments

System time, clocks, and their syncing in macOS

https://eclecticlight.co/2025/05/21/system-time-clocks-and-their-syncing-in-macos/
1•fanf2•37m ago•0 comments

McCLIM and 7GUIs – Part 1: The Counter

https://turtleware.eu/posts/McCLIM-and-7GUIs---Part-1-The-Counter.html
2•ramenbytes•39m ago•0 comments

So whats the next word, then? Almost-no-math intro to transformer models

https://matthias-kainer.de/blog/posts/so-whats-the-next-word-then-/
1•oesimania•41m ago•0 comments

Ed Zitron: The Hater's Guide to Microsoft

https://bsky.app/profile/edzitron.com/post/3me7ibeym2c2n
2•vintagedave•44m ago•1 comments

UK infants ill after drinking contaminated baby formula of Nestle and Danone

https://www.bbc.com/news/articles/c931rxnwn3lo
1•__natty__•44m ago•0 comments

Show HN: Android-based audio player for seniors – Homer Audio Player

https://homeraudioplayer.app
3•cinusek•45m ago•1 comments

Starter Template for Ory Kratos

https://github.com/Samuelk0nrad/docker-ory
1•samuel_0xK•46m ago•0 comments

LLMs are powerful, but enterprises are deterministic by nature

2•prateekdalal•50m ago•0 comments

Make your iPad 3 a touchscreen for your computer

https://github.com/lemonjesus/ipad-touch-screen
2•0y•55m ago•1 comments

Internationalization and Localization in the Age of Agents

https://myblog.ru/internationalization-and-localization-in-the-age-of-agents
1•xenator•55m ago•0 comments

Building a Custom Clawdbot Workflow to Automate Website Creation

https://seedance2api.org/
1•pekingzcc•58m ago•1 comments

Why the "Taiwan Dome" won't survive a Chinese attack

https://www.lowyinstitute.org/the-interpreter/why-taiwan-dome-won-t-survive-chinese-attack
2•ryan_j_naughton•58m ago•0 comments

Xkcd: Game AIs

https://xkcd.com/1002/
2•ravenical•1h ago•0 comments

Windows 11 is finally killing off legacy printer drivers in 2026

https://www.windowscentral.com/microsoft/windows-11/windows-11-finally-pulls-the-plug-on-legacy-p...
2•ValdikSS•1h ago•0 comments

From Offloading to Engagement (Study on Generative AI)

https://www.mdpi.com/2306-5729/10/11/172
1•boshomi•1h ago•1 comments
Open in hackernews

OpenZL: An open source format-aware compression framework

https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/
434•terrelln•4mo ago
https://github.com/facebook/openzl

https://arxiv.org/abs/2510.03203

https://openzl.org/

Comments

felixhandte•4mo ago
In addition to the blog post, here are the other things we've published today:

Code: https://github.com/facebook/openzl

Documentation: https://openzl.org/

White Paper: https://arxiv.org/abs/2510.03203

dang•4mo ago
We'll put those links in the toptext above.
unsigner•4mo ago
Congrats on the release. I was wondering what the zstd team is up to lately.

You mentioned something about grid structured data being in the plans - can you give more details?

Have you done experiments with compressing BCn GPU texture formats? They have a peculiar branched structure, with multiple sub formats packed tightly in bitfields of 64- or 128-bit blocks; due to the requirement of fixed ratio and random access by the GPU they still leave some potential compression on the table.

waustin•4mo ago
This is such a leap forward it's hard to believe it's anything but magic.
gmuslera•4mo ago
I used to see as magic that the old original compression algorithms worked so well with generic text, without worrying about format, file type, structure or other things that could give hints of additional redundancy.
wmf•4mo ago
Compared to columnar databases this is more of an incremental improvement.
kingstnap•4mo ago
Wow this sounds nuts. I want to try this on some large csvs later today.
felixhandte•4mo ago
Let us know how it goes!

We developed OpenZL initially for our own consumption at Meta. More recently we've been putting a lot of effort into making this a usable tool for people who, you know, didn't develop OpenZL. Your feedback is welcome!

ionelaipatioaei•4mo ago
I must be doing something wrong but I couldn't manage to compressed a file using a custom trained profile since I was getting this error:

``` src/openzl/codecs/dispatch_string/encode_dispatch_string_binding.c:74: EI_dispatch_string: splitting 48000001 strings into 14 outputs OpenZL Library Exception: OpenZL error code: 55 OpenZL error string: Input does not respect conditions for this node OpenZL error context: Code: Input does not respect conditions for this node Message: Check `eltWidth != 2' failed where: lhs = (unsigned long) 4 rhs = (unsigned long) 2

Graph ID: 5 Stack Trace: #0 doEntropyConversion (src/openzl/codecs/entropy/encode_entropy_binding.c:788): Check `eltWidth != 2' failed where: lhs = (unsigned long) 4 rhs = (unsigned long) 2

#1 EI_entropyDynamicGraph (src/openzl/codecs/entropy/encode_entropy_binding.c:860): Forwarding error: #2 CCTX_runGraph_internal (src/openzl/compress/cctx.c:770): Forwarding error: #3 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #4 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #5 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #6 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #7 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #8 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #9 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #10 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #11 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #12 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #13 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #14 CCTX_startCompression (src/openzl/compress/cctx.c:1276): Forwarding error: #15 CCTX_compressInputs_withGraphSet_stage2 (src/openzl/compress/compress2.c:116): Forwarding error: ```

On the other hand the default CSV profile didn't seem that great either, the CSV file was 349 MB and it compressed it down to 119MB while a ZIP file of the CSV is 105MB.

TheKaibosh•4mo ago
This is unexpected... I'm interested in seeing what's happening here. Do you mind creating a Github issue with as much info as you're comfortable sharing? https://github.com/facebook/openzl/issues
zzulus•4mo ago
Meta's Nimble is natively integrated with OpenZL (pre-OSS version), and is insanely benefiting from it.
terrelln•4mo ago
Yeah, backend compression in columnar data formats is a natural fit for OpenZL. Knowing the data it is compressing is numeric, e.g. a column of i64 or float, allows for immediate wins over Zstandard.
felixhandte•4mo ago
It was really hard to resist spilling the beans about OpenZL on this recent HN post about compressing genomic sequence data [0]. It's a great example of the really simple transformations you can perform on data that can unlock significant compression improvements. OpenZL can perform that transformation internally (quite easily with SDDL!).

[0] https://news.ycombinator.com/item?id=45223827

bede•4mo ago
Author of [0] here. Congratulations and well done for resisting. Eager to try it!

Edit: Have you any specific advice for training a fasta compressor beyond that given in e.g. "Using OpenZL" (https://openzl.org/getting-started/using-openzl/)

perching_aix•4mo ago
That post immediately came to my mind too! Do you maybe have a comparison to share with respect to the specialized compressor mentioned in the OP there?

> Grace Blackwell’s 2.6Tbp 661k dataset is a classic choice for benchmarking methods in microbial genomics. (...) Karel Břinda’s specialist MiniPhy approach takes this dataset from 2.46TiB to just 27GiB (CR: 91) by clustering and compressing similar genomes together.

Gethsemane•4mo ago
I'd love to see some benchmarks for this on some common genomic formats (fa, fq, sam, vcf). Will be doubly interesting to see its applicability to nanopore data - lots of useful data is lost because storing FAST5/POD5 is a pain.
jayknight•4mo ago
And a comparison between CRAM and openzl on a sam/bam file. Is openzl indexable, where you can just extract and decompress the data you need from a file if you know where it is?
terrelln•4mo ago
> Is openzl indexable

Not today. However, we are considering this as we are continuing to evolve the frame format, and it is likely we will add this feature in the future.

jltsiren•4mo ago
OpenZL compressed SAM/BAM vs. CRAM is the interesting comparison. It would really test the flexibility of the framework. Can OpenZL reach the same level of compression, and how much effort does it take?

I would not expect much improvement in compressing nanopore data. If you have a useful model of the data, creating a custom compressor is not that difficult. It takes some effort, but those formats are popular enough that compressors using the known models should already exist.

terrelln•4mo ago
Do you happen to have a pointer to a good open source dataset to look at?

Naively and knowing little about CRAM, I would expect that OpenZL would beat Zstd handily out of the box, but need additional capabilities to match the performance of CRAM, since genomics hasn't been a focus as of yet. But it would be interesting to see how much we need to add is generic to all compression (but useful for genomics), vs. techniques that are specific only to genomics.

We're planning on setting up a blog on our website to highlight use cases of OpenZL. I'd love to make a post about this.

bede•4mo ago
For BAM this could be a good place to start: https://www.htslib.org/benchmarks/CRAM.html

Happy to discuss further

terrelln•4mo ago
Amazing, thank you!

I will take a look as soon as I get a chance. Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting.

fwip•4mo ago
Another format that might be worth looking at in the bioinformatics world is hdf5. It's sort of a generic file format, often used for storing multiple related large tables. It has some built-in compression (gzip IIRC) but supports plugins. There may be an opportunity to integrate the self-describing nature of the hdf5 format with the self-describing decompression routines of openZL.
felixhandte•4mo ago
Wanna hop over to https://github.com/facebook/openzl/issues/76?
felixhandte•4mo ago
Update: let's continue discussing genomic sequence compression on https://github.com/facebook/openzl/issues/76.
fnands•4mo ago
Cool, but what's the Weissman Score?
fnands•4mo ago
Alright, Silicon Valley references are not popular on HN it seems.
slmkbh•4mo ago
Lack of self irony... I was also looking for this :)

Having just re watched the show, it is remarkable how little changed for the better...

bigwheels•4mo ago
How do you use it to compress a directory (or .tar file)? Not seeing any example usages in the repo, `zli compress -o dir.tar.zl dir.tar` ->

  Invalid argument(s):
    No compressor profile or serialized compressor specified.
Same thing for the `train` command.

Edit: @terrelln Got it, thank you!

terrelln•4mo ago
There's a Quick Start guide here:

https://openzl.org/getting-started/quick-start/

However, OpenZL is different in that you need to tell the compressor how to compress your data. The CLI tool has a few builtin "profiles" which you can specify with the `--profile` argument. E.g. csv, parquet, or le-u64. They can be listed with `./zli list-profiles`.

You can always use the `serial` profile, but because you haven't told OpenZL anything about your data, it will just use Zstandard under the hood. Training can learn a compressor, but it won't be able to learn a format like `.tar` today.

If you have raw numeric data you want to throw at it, or Parquets or large CSV files, thats where I would expect OpenZL to perform really well.

ttoinou•4mo ago
Is this similar to Basis ? https://github.com/BinomialLLC/basis_universal
modeless•4mo ago
No, not really. They are both cool but solve different problems. The problem Basis solves is that GPUs don't agree on which compressed texture formats to support in hardware. Basis is a single compressed format that can be transcoded to almost any of the formats GPUs support, which is faster and higher quality than e.g. decoding a JPEG and then re-encoding to a GPU format.
ttoinou•4mo ago
Thanks. I thought basis also had specific encoders depending on the typical average / nature of the data input, like this OpenZL project
modeless•4mo ago
It probably does have different modes that it selects based on the input data. I don't know that much about the implementation of image compression, but I know that PNG for example has several preprocessing modes that can be selected based on the image contents, which transform the data before entropy encoding for better results.

The difference with OpenZL IIUC seems to be that it has some language that can flexibly describe a family of transformations, which can be serialized and included with the compressed data for the decoder to use. So instead of choosing between a fixed set of transformations built into the decoder ahead of time, as in PNG, you can apply arbitrary transformations (as long as they can be represented in their format).

ttoinou•4mo ago
Thank you for the explanation !
nunobrito•4mo ago
Well, well. Kind of surprised to see this really good tool that should have been made available a longer time ago since the approach is quite sound.

When the data container is understood, the deduplication is far more efficient because now it is targeted.

Licensed as BSD-3-Clause, solid C++ implementation, well documented.

Will be looking forward to see new developments as more file formats are contributed.

mappu•4mo ago
Specialization for file formats is not novel (e.g. 7-Zip uses BCJ2 prefiltering to convert x86 opcodes from absolute to relative JMP instructions), nor is embedding specialized decoder bytecode in the archive (e.g. ZPAQ did this and won a lot of Matt Mahoney's benchmarks) but i think OpenZL's execution here, along with the data description and training system, is really fantastic.
nunobrito•4mo ago
Thanks, I've enjoyed reading more about ZPAQ but their main focus seems to be versioning (which is quite a useful feature too, will try it later) but they don't include specialized compression per context.

Like you mention, the expandability is quite something. In a few years we might see a very capable compressor.

maeln•4mo ago
So, as I understand, you describe the structure of your data in an SDL and then the compressor can plan a strategy on how to best compress the various part of the data ?

Honestly looks incredible. Could be amazing to provide a general framework for compressing custom format.

terrelln•4mo ago
Exactly! SDDL [0] provides a toolkit to do this all with no-code, but today is pretty limited. We will be expanding its feature set, but in the meantime you can also write code in C++ or Python to parse your format. And this code is compression side only, so the decompressor is agnostic to your format.

[0] https://openzl.org/api/c/graphs/sddl/

maeln•4mo ago
Now I cannot stop thinking about how I can fit this somewhere in my work hehe. ZStandard already blew me away when it was released, and this is just another crazy work. And being able to access this kind of state-of-the-art algo' for free and open-source is the oh so sweet cherry on top
touisteur•4mo ago
How happy I am to have all written/read data going through a DSL. On to generating the code to make OpenZL happy...
d33•4mo ago
I've recently been wondering: could you re-compress gzip to a better compression format, while keeping all instructions that would let you recover a byte-exact copy of the original file? I often work with huge gzip files and they're a pain to work with, because decompression is slow even with zlib-ng.
artemisart•4mo ago
I may be misunderstanding the question but that should be just decompressing gzip & compressing with something better like zstd (and saving the gzip options to compress it back), however it won't avoid compressing and decompressing gzip.
mappu•4mo ago
precomp/antix/... are tools that can bruteforce the original gzip parameters and let you recreate the byte-identical gzip archive.

The output is something like {precomp header}{gzip parameters}{original uncompressed data} which you can then feed to a stronger compressor.

A major use case is if you have a lot of individually gzipped archives with similar internal content, you can precomp them and then use long-range solid compression over all your archives together for massive space savings.

Dylan16807•4mo ago
> A major use case is if you have a lot of individually gzipped archives with similar internal content, you can precomp them and then use long-range solid compression over all your archives together for massive space savings.

Or even a single gzipped archive with similar pieces of content that are more than 32KB apart.

o11c•4mo ago
That's called `pristine-gz`, part of the `pristine-tar` project.
d33•4mo ago
Thank you! It seems to be what I'm looking for.
dist-epoch•4mo ago
Is this useful for highly repetitive JSON data? Something like stock prices for example, one JSON per line.

Unclear if this has enough "structure" for OpenZL.

wmf•4mo ago
Maybe convert to BSON first then compress it.
terrelln•4mo ago
You'd have to tell OpenZL what your format looks like by writing a tokenizer for it, and annotating which parts are which. We aim to make this easier with SDDL [0], but today is not powerful enough to parse JSON. However, you can do that in C++ or Python.

Additionally, it works well on numeric data in native format. But JSON stores it in ASCII. We can transform ASCII integers into int64 data losslessly, but it is very hard to transform ASCII floats into doubles losslessly and reliably.

However, given the work to parse the data (and/or massage it to a more friendly format), I would expect that OpenZL would work very well. Highly repetitive, numeric data with a lot of structure is where OpenZL excels.

[0] https://openzl.org/api/c/graphs/sddl/

kstenerud•4mo ago
I've done a binary representation of JSON-structured data that uses unary coding for variable length length fields: https://github.com/kstenerud/bonjson/blob/main/bonjson.md#le...

This tends to confuse generic compressors, even though the sub-byte data itself usually clusters around the smaller lengths for most data and thus can be quite repetitive (plus it's super efficient to encode/decode). Could this be described such that OpenZL can capitalize on it?

michalsustr•4mo ago
Are you thinking about adding stream support? I.e something along the lines of i) build up efficient vocabulary up front for the whole data and then ii) compress by chunks, so it can be decompressed by chunks as well. This is important for seeking in data and stream processing.
felixhandte•4mo ago
Yes, definitely! Chunking support is currently in development. Streaming and seeking and so on are features we will certainly pursue as we mature towards an eventual v1.0.0.
michalsustr•4mo ago
Great! I find apache arrow ipc as the most sensible format I found how to organise stream data. Headers first, so you learn what data you work with, columnar for good simd and compression, deeply nested data structures supported. Might serve as an inspiration.
TheMode•4mo ago
I understand it cannot work well on random text files, but would it support structured text? Like .c, .java or even JSON
jmakov•4mo ago
Wonder how it compares to zstd-9 since they only mention zstd-3
terrelln•4mo ago
The charts in the "Results With OpenZL" section compare against all levels of zstd, xz, and zlib.

On highly structured data where OpenZL is able to understand the format, it blows Zstandard and Xz out of the water. However, not all data fits this bill.

jmakov•4mo ago
Couldn't the input be automatically described/guessed using a few rows of data and a LLM?
terrelln•4mo ago
You could have an LLM generate the SDDL description [0] for you, or even have it write a C++ or Python tokenizer. If compression succeeds, then it is guaranteed to round trip, as the LLM-generated logic lives only on the compression side, and the decompressor is agnostic to it.

It could be a problem that is well-suited to machine learning, as there is a clear objective function: Did compression succeed, and if so what is the compressed size.

[0] https://openzl.org/api/c/graphs/sddl/

stepanhruda•4mo ago
Is there a way to use this with blosc?
yubblegum•4mo ago
Couldn't find in the paper a description of how the DAG itself is encoded. Any ideas?
terrelln•4mo ago
We left it out of the paper because it is an implementation detail that is absolutely going to change as we evolve the format. This is the function that actually does it [0], but there really isn't anything special here. There are some bit-packing tricks to save some bits, but nothing crazy.

Down the line, we expect to improve this representation to shrink it further, which is important for small data. And to allow to move this representation, or parts of it, into a dictionary, for tiny data.

[0] https://github.com/facebook/openzl/blob/d1f05d0aa7b8d80627e5...

yubblegum•4mo ago
Thanks! (Super cool idea btw.)
viraptor•4mo ago
I wonder, given the docs, how well could AI translate imhex and Kaitai descriptions into SDDL. We could get a few good schemas quickly that way.
felixhandte•4mo ago
Ooh, thanks for mentioning these! I wasn't aware of the existence of these tools but yes it seems very possible that you could transform these other spec formats into SDDL descriptions. I'll check them out.
pabs3•4mo ago
There are a ton of these. GNU Poke comes to mind too.
dloss•4mo ago
Here's a list: https://github.com/dloss/binary-parsing
Havoc•4mo ago
That looks great

Are the compression speed chart all like-for-like in terms of what is hw accelerated vs not?

felixhandte•4mo ago
Yes. None of the algorithms under test used any hardware acceleration in the benchmarks we ran.
fitzn•4mo ago
Non-Linear Compression! We had a tiny idea back in the day in this space but never got too far with it (https://www.usenix.org/conference/hotstorage12/workshop-prog...).

I am pumped to see this. Thanks for sharing.

magicalhippo•4mo ago
On a semi-related note, there was recently a discussion[1] on the F3 file format, which also allows for format-aware compression by embedding the decompressor code as WASM. Though the main motivation for F3 was future compatibility, it does allow for bespoke compression algorithms.

This takes a very different approach, and wouldn't require a full WASM runtime. Though it does have the SDDL compiler and runtime, though I assume it's a lighter dependency.

[1]: https://news.ycombinator.com/item?id=45437759 F3: Open-source data file format for the future [pdf] (125 comments)

snapplebobapple•4mo ago
Isnt that a huge vector for viruses if exevutable code is included in the compressed archive?
themerone•4mo ago
Wasm can be sandboxed. Its a safe as visiting a website with javascript.
orangeboats•4mo ago
Can't the decompressor still produce a malicious uncompressed file?
tlb•4mo ago
Any decompressor can produce a malicious file. Just feed a malicious file to the compressor.
orangeboats•4mo ago
Yes, but currently the decompressors we use (so things like zstd, zlib, 7z) come from a mostly-verifiable source -- either you downloaded it straight from the official site, or you got it from your distro repo.

However, we are talking about an arbitrary decompressor here. The decompressor WASM is sandboxed from the outside world and it can't wreak havoc on your system, true, but nothing stops it from producing a malicious uncompressed file from a known good compressed file.

yorwba•4mo ago
If the decompressor is included in the compressed file and it's malicious, the file can hardly be called known good.
tecleandor•4mo ago
But also I guess the logic of the decompressor could output different files in different occasions, for example, if it detects a victim, making it difficult to verify.
viraptor•4mo ago
If it can "detect a victim", then the sandbox is faulty. The decompressor shouldn't see any system details. Only the input and output streams.
mort96•4mo ago
The format-specific decompressor is part of the compressed file. Nothing here crosses a security boundary. Either the compressed file is trustworthy and therefore decompresses into a trustworthy file, or the compressed file is not trustworthy and therefor decompresses into a non-trustworthy file.

If the compressed file is malicious, it doesn't matter whether it's malicious because it originated from a malicious uncompressed file, or is malicious because it originated from a benign uncompressed file and the transformation into a compressed file introduces the malicious parts due to the bundled custom decompressor.

jo-m•4mo ago
So, not very safe.
snapplebobapple•4mo ago
I think this is the first time a genuine technical question of mine rather than a social view has been downvoted here. Thats sad.
TiredOfLife•4mo ago
And no mention of zpaq that has had emedable decompressors feature for 15 years
blank_state•4mo ago
you did not read the white paper then
lifthrasiir•4mo ago
As someone seriously trying to develop a compressed archive format with WebAssembly, sandboxing is actually easy and that's indeed why WebAssembly was chosen. The real problem is determinism, which WebAssembly does technically support but actual implementations may vary significantly. And even when WebAssembly can be made fully deterministic, function calls made to those WebAssembly modules may still be undeterministic! I tried very hard to avoid such pitfalls in my design, and it is entirely reasonable to avoid WebAssembly due to these issues.
bangaladore•4mo ago
I'm confused why determinism is a problem here? You write an algorithm that should produce the same output for a given input. How does WASM make that not deterministic?
lifthrasiir•4mo ago
Assume that I have 120 MB of data to process. Since this is quite large, implementations may want to process them in chunks (say, 50 MB). Now those implementations would call the WebAssembly module multiple times with different arguments, and input sizes would depend on the chunk size. Even though each call is deterministic, if you vary arguments non-deterministically then you lose any benefit of determinism: any bug in the WebAssembly module will corrupt data.
bangaladore•4mo ago
But that is the case in any language and runtime? There is nothing unique about WASM here.
lifthrasiir•4mo ago
Yes and that's exactly my point. It is not enough to make the execution deterministic.

Thinking about that, you may have been confused why I said it's reasonable to avoid WebAssembly for that. I meant that a full Turing-complete execution might not be necessary if that makes it easier to ensure the correctness; OpenZL graphs are not even close to a Turing-complete language for example.

porridgeraisin•4mo ago
This method reminds me of how deep learning models get compressed for deployment on accelerators. You take advantage of different redundancies of different data structures and compress each of them using a unique method.

Specifically the dictionary + delta-encoded + huffman'd index lists method mentioned in TFA, is commonly used for compressing weights. Weights tend to be sparse, but clustered, meaning most offsets are small numbers with the occasional jump, which is great for huffman.

piterrro•4mo ago
Is it beneficial for logs compression assuming you log to JSON but you dont know schema upfront? Im workong on a logs compression tool and Im wondering whether OpenZL fits there

[0] https://logdy.dev/logdy-pro

ohnoesjmr•4mo ago
Does this support seekable compression?
xyzzy3000•4mo ago
What's the patent encumberment status of this algorithm?
squirrellous•4mo ago
One of the mentioned examples sounds like the compressor is taking advantage of the SDDL by treating row-oriented data as stripes of column-oriented data, and then compressing that. This makes me curious - for data that’s already column-oriented like Parquet, what’s the advantage of OpenZL over zstd?
felixhandte•4mo ago
SDDL (and the front-end task of reshaping data in general) is only one component of OpenZL. Once you have the streams, you can do all sorts of transformations to them that Zstd doesn't.
eyegor•4mo ago
Plans for language bindings? Should be trivial to whip up simpler ones like python or dotnet but I didn't see any official bindings yet.
telendram•4mo ago
Python is part of the official bindings provided in the repository
hokkos•4mo ago
it reminds me of the EXI compression for XML that can be very optimized with a XSD Schema with a schema aware compression, that also use the schema graph for optimal compression : https://www.w3.org/TR/exi-primer/
p1mrx•4mo ago
I tried compressing some CD quality PCM audio: wav=54MB, zstd=51MB, zl=42MB, flac=39MB.

So OpenZL is significantly better than zstd, but worse than flac.

altcognito•4mo ago
Is that with training or without?
p1mrx•4mo ago
I think training is mandatory. These are the commands I used:

https://gist.github.com/pmarks-net/64c17aff45e7741f07eeb5dd0...

terrelln•4mo ago
Out of curiosity, what was the input file format?

We actually worked on a demo WAV compressor a while back. We are currently missing codecs to run the types of predictors that FLAC runs. We expect to add this kind of functionality in the future, in a generic way that isn't specific to audio, and can be used across a variety of domains.

But, generally we wouldn't expect to generally beat FLAC. But, be able to offer specialized compressors for many types of data that previously weren't important enough to spawn a whole field of specialized compressors, by significantly lowering the bar for entry.

p1mrx•4mo ago
The input was just CD audio, "One More Time" by Daft Punk.

test.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz

adrianmonk•4mo ago
This is great stuff!

Any plans to make it so one format can reference another format? Sometimes data of one type occurs within another format, especially with archive files, media container files, and disk images.

So, for example, suppose someone adds a JSON format to OpenZL. Then someone else adds a tar format. While parsing a tar file, if it contains foo.json, there could be some way of saying to OpenZL, "The next 1234 bytes are in the JSON format." (Maybe OpenZL's frames would allow making context shifts like this?)

A related thing that would also be nice is non-contiguous data. Some formats include another format but break up the inner data into blocks. For example, a network capture of a TCP stream would include TCP/IP headers, but the payloads of all the packets together constitute another stream of data in a certain format. (This might get memory intensive, though, since there's multiplexing, so you may need to maintain many streams/contexts.)

felixhandte•4mo ago
The OpenZL core supports arbitrary composition of graphs. So you can do this now via the compressor construction APIs. We just have to figure out how to make it easy to do.
yinnovator•3mo ago
I am trying to compress a file which has size lot larger than 2 GB , but i am getting error Unhandled Exception: Chunking support is required for compressing inputs larger than 2 GiB. Can't we compress big files with OpenZL , can't find about this error in any documentation