frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

SectorC: A C Compiler in 512 bytes

https://xorvoid.com/sectorc.html
85•valyala•4h ago•16 comments

Brookhaven Lab's RHIC concludes 25-year run with final collisions

https://www.hpcwire.com/off-the-wire/brookhaven-labs-rhic-concludes-25-year-run-with-final-collis...
23•gnufx•2h ago•14 comments

The F Word

http://muratbuffalo.blogspot.com/2026/02/friction.html
35•zdw•3d ago•4 comments

Software factories and the agentic moment

https://factory.strongdm.ai/
89•mellosouls•6h ago•167 comments

I write games in C (yes, C)

https://jonathanwhiting.com/writing/blog/games_in_c/
132•valyala•4h ago•99 comments

Speed up responses with fast mode

https://code.claude.com/docs/en/fast-mode
47•surprisetalk•3h ago•52 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
143•AlexeyBrin•9h ago•26 comments

Stories from 25 Years of Software Development

https://susam.net/twenty-five-years-of-computing.html
96•vinhnx•7h ago•13 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
850•klaussilveira•23h ago•256 comments

First Proof

https://arxiv.org/abs/2602.05192
66•samasblack•6h ago•51 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
1092•xnx•1d ago•618 comments

Al Lowe on model trains, funny deaths and working with Disney

https://spillhistorie.no/2026/02/06/interview-with-sierra-veteran-al-lowe/
64•thelok•5h ago•9 comments

Show HN: A luma dependent chroma compression algorithm (image compression)

https://www.bitsnbites.eu/a-spatial-domain-variable-block-size-luma-dependent-chroma-compression-...
4•mbitsnbites•3d ago•0 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
233•jesperordrup•14h ago•80 comments

Start all of your commands with a comma (2009)

https://rhodesmill.org/brandon/2009/commands-with-comma/
516•theblazehen•3d ago•191 comments

Reinforcement Learning from Human Feedback

https://rlhfbook.com/
93•onurkanbkrc•8h ago•5 comments

Selection Rather Than Prediction

https://voratiq.com/blog/selection-rather-than-prediction/
13•languid-photic•3d ago•4 comments

We mourn our craft

https://nolanlawson.com/2026/02/07/we-mourn-our-craft/
333•ColinWright•3h ago•401 comments

Coding agents have replaced every framework I used

https://blog.alaindichiappari.dev/p/software-engineering-is-back
254•alainrk•8h ago•412 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
182•1vuio0pswjnm7•10h ago•251 comments

France's homegrown open source online office suite

https://github.com/suitenumerique
611•nar001•8h ago•269 comments

72M Points of Interest

https://tech.marksblogg.com/overture-places-pois.html
35•marklit•5d ago•6 comments

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

https://github.com/Momciloo/fun-with-clip-path
27•momciloo•4h ago•5 comments

A Fresh Look at IBM 3270 Information Display System

https://www.rs-online.com/designspark/a-fresh-look-at-ibm-3270-information-display-system
47•rbanffy•4d ago•9 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
124•videotopia•4d ago•39 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
96•speckx•4d ago•109 comments

History and Timeline of the Proco Rat Pedal (2021)

https://web.archive.org/web/20211030011207/https://thejhsshow.com/articles/history-and-timeline-o...
20•brudgers•5d ago•5 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
211•limoce•4d ago•117 comments

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

https://github.com/sandys/kappal
32•sandGorgon•2d ago•15 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
287•isitcontent•1d ago•38 comments
Open in hackernews

Embedding user-defined indexes in Apache Parquet

https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
144•jasim•6mo ago

Comments

jasim•6mo ago
I think this post is a response to some new file format initiatives, based on the criticism that the Parquet file format is showing its age.

One of the arguments is that there is no standardized way to extend Parquet with new kinds of metadata (like statistical summaries, HyperLogLog etc.)

This post was written by the DataFusion folks, who have shown a clever way to do this without breaking backward compatibility with existing readers.

They have inserted arbitrary data between footer and data pages, which other readers will ignore. But query engines like DataFusion can exploit it. They embed a new index to the .parquet file, and use that to improve query performance.

In this specific instance, they add an index with all the distinct values of a column. Then they extend the DataFusion query engine to exploit that so that queries like `WHERE nation = 'Singapore'` can use that index to figure out whether the value exists in that .parquet file without having to scan the data pages (which is already optimized because there is a min-max filter to avoid scanning the entire dataset).

Also in general this is a really good deep dive into columnar data storage.

jjtheblunt•6mo ago
nice summary!
dmvinson•6mo ago
What are the new file format initiatives you're referencing here?

This solution seems clever overall, and finding a way to bolt on features of the latest-and-greatest new hotness without breaking backwards compatibility is a testament to the DataFusion team. Supporting legacy systems is crucial work, even if things need a ground-up rewrite periodically.

dkdcio•6mo ago
Lance (from LanceDB folks), Nimble (from Meta folks, formerly known as Alpha); I think there are a few others

https://github.com/lancedb/lance

https://github.com/facebookincubator/nimble

kernelsanderz•6mo ago
I’ve been excited about lancedb and its ability to support vector indexes and efficient row level lookups. I wonder if this approach would work for their design goals and still allow broader backwards compatibility with the parquet ecosystem. Have been intrigued by Ducklake, and they’ve leaned into parquet. Perhaps this approach will allow more flexible indexing approaches with support for the broader parquet ecosystem which is significant.
MasterIdiot•6mo ago
Off the top of my head:

- Vortex https://github.com/vortex-data/vortex

- Lance https://github.com/lancedb/lance

- Nimble https://github.com/facebookincubator/nimble

There are also a bunch of ideas coming out of academia, but I don't know how many of them have a sustained effort behind them and not just a couple of papers

lmeyerov•6mo ago
Yeah I'm happy to see this, we have been curious as part of figuring out cloud native storage extensions to GFQL (graph dataframe-native query lang), and my intuition was parquet was pluggable here... And this is the first I'm seeing a cogent writeup.

Likewise, this means, afaict, it's likewise pretty straightforward to do novel indexing schemes within Iceberg as well just by reusing this.

The other aspect I've been curious about is the happy path pluggable types for custom columns. This shows one way, but I'm unclear if same thing.

jasim•6mo ago
I'm not sure if this is what you're looking for, but there is a proposal in DataFusion to allow user defined types. https://github.com/apache/datafusion/issues/12644
lmeyerov•6mo ago
Thank you, looking forward to reading!
alamb•6mo ago
We are actively working on supporting extension types. The mechanism is likely to be using the Arrow extension type mechanism (a logical annotation on top of existing Arrow types https://arrow.apache.org/docs/format/Columnar.html#format-me...)

I expect this to be used to support Variant https://github.com/apache/datafusion/issues/16116 and geometry types

(note I am an author)

hodgesrm•6mo ago
One question that the article does not cover: compaction. Adding custom indexes means you have to have knowledge of the indexes to compact Parquet files, since you'll want to reindex each time compaction occurs. Otherwise the indexes will at best be discarded. At worst they would even be corrupted.

So it looks as if adopting custom indexes mean you are adopting not just a particular engine for reading but also a particular engine for compaction. That in turn means you can't use generic mechanisms like the compaction mechanism in S3 table buckets. Am I missing something?

deepsun•6mo ago
My main problem with Parquet format is that it depends on Facebook's Thrift (competitor to gRPC).
Nelkins•6mo ago
Cool, but this is very specific to DataFusion, no? Is there any chance this would be standardized so other Parquet readers could leverage the same technique?
gdubya•6mo ago
The technique can be applied by any engine, not just DataFusion. Each engine would have to know about the indexes in order to make use of them, but the fallback to parquet standard defaults means that the data is still readable by all.
aerzen•6mo ago
But does data fusion publish a specification of how this metadata can be read, along with a test suite for verifying implementations? Because if they don't, this cannot be reliably used by any other impl
jasim•6mo ago
Parquet files include a field called key_value_metadata in the FileMetadata structure; it sits in the footer of the file. See: https://github.com/apache/parquet-format/blob/master/src/mai...

The technique described in the article, seems to use this key-value pair to store pointers to the additional metadata (in this case a distinct index) embedded in the file. Note that we can embed arbitrary binary data in the Parquet file between each data page. This is perfectly valid since all Parquet readers rely on the exact offsets to the data pages specified in the footer.

This means that DataFusion does not need to specify how the metadata is interpreted. It is already well specified as part of the Parquet file format itself. DataFusion is an independent project -- it is a query execution engine for OLAP / columnar data, which can take in SQL statements, build query plan, optimize them, and execute. It is an embeddable runtime with numerous ways to extend it by the host program. Parquet is a file format supported by DataFusion because it is one of the most popular ways of storing data in a columnar way in object storages like S3.

Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes - as long as we're embedding only supplementary information like indices or bloom filters, a reader can still continue working with the columnar data in Parquet as it used to; it is just that it won't be able to take advantage of the additional metadata.

SiempreViernes•6mo ago
So, can we take that as a "no"?
gazpacho•6mo ago
There is no spec. Personally I hope that the existing indexes (bloom filters, zone maps) get re-designed to fit into a paradigm where parquet itself has more first class support for multiple levels of indexes embedded in the file and conventions for how those common types. That is, start with Wild West and define specs as needed
alamb•6mo ago
> That is, start with Wild West and define specs as needed

Yes this is my personal hope as well -- if there are new index types that are widespread, they can be incorporated formally into the spec

However, changing the spec is a non trivial process and requires significant consensus and engineering

Thus the methods used in the blog can be used to use indexes prior to any spec change and potentially as a way to prototype / prove out new potential indexes

(note I am an author)

alamb•6mo ago
> Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes

The one downside of this approach, which is likely obvious, but I haven't seen mentioned is that the resulting parquet files are larger than they would be otherwise, and the increased size only benefits engines that know how to interpret the new index

(I am an author)

DAlperin•6mo ago
The story here isn't that they've invented a new format for user defined indexes (the one proposed here is sort of contrived and I probably wouldn't recommend in production) but rather demonstrating how the user defined metadata space of the parquet format can be used for application specific purposes.

I work on a database engine that uses parquet as our on-storage file format and we make liberal use of the custom metadata area for things specific to our product that any other parquet readers would just ignore.

ethan_smith•6mo ago
The Arrow/Parquet community is already discussing standardization via the Parquet format GitHub - this approach intentionally uses existing extension points in the format specification to remain compatible while the standardization discussions progress.
gregw2•6mo ago
Note that there are "Puffin files" associated with Iceberg which have some overlap with this functionality: https://iceberg.apache.org/puffin-spec/#file-structure
DonHopkins•6mo ago
Speaking of Puffin files, Apache Parquet always makes me think of this 1978 SNL intro with Bill Murray and SNL bass player Buddy Williams:

https://snltranscripts.jt.org/77/77sparaquat.phtml