frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

LIGO detects most massive black hole merger to date

https://www.caltech.edu/about/news/ligo-detects-most-massive-black-hole-merger-to-date
128•Eduard•3h ago•58 comments

RFC: PHP license update

https://wiki.php.net/rfc/php_license_update
85•josephwegner•1h ago•23 comments

Apple's MLX adding CUDA support

https://github.com/ml-explore/mlx/pull/1983
60•nsagent•1h ago•28 comments

DEWLine Museum – The Distant Early Warning Radar Line

https://dewlinemuseum.com/
9•reaperducer•51m ago•0 comments

Kiro: A new agentic IDE

https://kiro.dev/blog/introducing-kiro/
623•QuinnyPig•8h ago•272 comments

NeuralOS: An operating system powered by neural networks

https://neural-os.com/
56•yuntian•3h ago•20 comments

Show HN: Bedrock – An 8-bit computing system for running programs anywhere

https://benbridle.com/projects/bedrock.html
37•benbridle•4d ago•7 comments

Cognition (Devin AI) to Acquire Windsurf

https://cognition.ai/blog/windsurf
317•alazsengul•5h ago•253 comments

Replicube: 3D shader puzzle game, online demo

https://replicube.xyz/staging/
64•inktype•3d ago•11 comments

Context Rot: How increasing input tokens impacts LLM performance

https://research.trychroma.com/context-rot
40•kellyhongsn•3h ago•8 comments

Anthropic, Google, OpenAI and XAI Granted Up to $200M from Defense Department

https://www.cnbc.com/2025/07/14/anthropic-google-openai-xai-granted-up-to-200-million-from-dod.html
80•ChrisArchitect•2h ago•53 comments

Building Modular Rails Applications: A Deep Dive into Rails Engines

https://www.panasiti.me/blog/modular-rails-applications-rails-engines-active-storage-dashboard/
111•giovapanasiti•7h ago•26 comments

SQLite async connection pool for high-performance

https://github.com/slaily/aiosqlitepool
32•slaily•3d ago•17 comments

Show HN: The HTML Maze – Escape an eerie labyrinth built with HTML pages

https://htmlmaze.com/
20•kyrylo•2h ago•2 comments

Cidco MailStation as a Z80 Development Platform (2019)

https://jcs.org/2019/05/03/mailstation
36•robin_reala•5h ago•3 comments

Embedding user-defined indexes in Apache Parquet

https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
81•jasim•6h ago•10 comments

Strategies for Fast Lexers

https://xnacly.me/posts/2025/fast-lexer-strategies/
116•xnacly•8h ago•41 comments

Japanese grandparents create life-size Totoro with bus stop for grandkids (2020)

https://mymodernmet.com/totoro-sculpture-bus-stop/
222•NaOH•7h ago•54 comments

Meticulous (YC S21) is hiring in UK to redefine software dev

https://tinyurl.com/join-meticulous
1•Gabriel_h•6h ago

Lightning Detector Circuits

https://techlib.com/electronics/lightningnew.htm
64•nateb2022•8h ago•35 comments

Data brokers are selling flight information to CBP and ICE

https://www.eff.org/deeplinks/2025/07/data-brokers-are-selling-your-flight-information-cbp-and-ice
382•exiguus•7h ago•184 comments

Tandy Corporation, Part 3 Becoming IBM Compatible

https://www.abortretry.fail/p/tandy-corporation-part-3
50•klelatti•3d ago•13 comments

East Asian aerosol cleanup has likely contributed to global warming

https://www.nature.com/articles/s43247-025-02527-3
144•defrost•13h ago•153 comments

Two guys hated using Comcast, so they built their own fiber ISP

https://arstechnica.com/tech-policy/2025/07/two-guys-hated-using-comcast-so-they-built-their-own-fiber-isp/
258•LorenDB•7h ago•166 comments

Impacts of adding PV solar system to internal combustion engine vehicles

https://www.jstor.org/stable/26169128
97•red369•12h ago•208 comments

The Corset X-Rays of Dr Ludovic O'Followell (1908)

https://publicdomainreview.org/collection/the-corset-x-rays-of-dr-ludovic-o-followell-1908/
21•healsdata•3d ago•1 comments

It took 45 years, but spreadsheet legend Mitch Kapor finally got his MIT degree

https://www.bostonglobe.com/2025/06/24/business/mitch-kapor-mit-degree-bill-aulet/
151•bookofjoe•3d ago•14 comments

Lossless Float Image Compression

https://aras-p.info/blog/2025/07/08/Lossless-Float-Image-Compression/
85•ingve•4d ago•10 comments

Why random selection is necessary to create stable meritocratic institutions

https://assemblingamerica.substack.com/p/there-is-no-meritocracy-without-lottocracy
193•namlem•7h ago•175 comments

A Century of Quantum Mechanics

https://home.cern/news/news/physics/century-quantum-mechanics
100•bookofjoe•4d ago•77 comments
Open in hackernews

Embedding user-defined indexes in Apache Parquet

https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
81•jasim•6h ago

Comments

jasim•5h ago
I think this post is a response to some new file format initiatives, based on the criticism that the Parquet file format is showing its age.

One of the arguments is that there is no standardized way to extend Parquet with new kinds of metadata (like statistical summaries, HyperLogLog etc.)

This post was written by the DataFusion folks, who have shown a clever way to do this without breaking backward compatibility with existing readers.

They have inserted arbitrary data between footer and data pages, which other readers will ignore. But query engines like DataFusion can exploit it. They embed a new index to the .parquet file, and use that to improve query performance.

In this specific instance, they add an index with all the distinct values of a column. Then they extend the DataFusion query engine to exploit that so that queries like `WHERE nation = 'Singapore'` can use that index to figure out whether the value exists in that .parquet file without having to scan the data pages (which is already optimized because there is a min-max filter to avoid scanning the entire dataset).

Also in general this is a really good deep dive into columnar data storage.

jjtheblunt•1h ago
nice summary!
dmvinson•1h ago
What are the new file format initiatives you're referencing here?

This solution seems clever overall, and finding a way to bolt on features of the latest-and-greatest new hotness without breaking backwards compatibility is a testament to the DataFusion team. Supporting legacy systems is crucial work, even if things need a ground-up rewrite periodically.

Nelkins•4h ago
Cool, but this is very specific to DataFusion, no? Is there any chance this would be standardized so other Parquet readers could leverage the same technique?
gdubya•4h ago
The technique can be applied by any engine, not just DataFusion. Each engine would have to know about the indexes in order to make use of them, but the fallback to parquet standard defaults means that the data is still readable by all.
aerzen•3h ago
But does data fusion publish a specification of how this metadata can be read, along with a test suite for verifying implementations? Because if they don't, this cannot be reliably used by any other impl
jasim•3h ago
Parquet files include a field called key_value_metadata in the FileMetadata structure; it sits in the footer of the file. See: https://github.com/apache/parquet-format/blob/master/src/mai...

The technique described in the article, seems to use this key-value pair to store pointers to the additional metadata (in this case a distinct index) embedded in the file. Note that we can embed arbitrary binary data in the Parquet file between each data page. This is perfectly valid since all Parquet readers rely on the exact offsets to the data pages specified in the footer.

This means that DataFusion does not need to specify how the metadata is interpreted. It is already well specified as part of the Parquet file format itself. DataFusion is an independent project -- it is a query execution engine for OLAP / columnar data, which can take in SQL statements, build query plan, optimize them, and execute. It is an embeddable runtime with numerous ways to extend it by the host program. Parquet is a file format supported by DataFusion because it is one of the most popular ways of storing data in a columnar way in object storages like S3.

Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes - as long as we're embedding only supplementary information like indices or bloom filters, a reader can still continue working with the columnar data in Parquet as it used to; it is just that it won't be able to take advantage of the additional metadata.

SiempreViernes•2h ago
So, can we take that as a "no"?
DAlperin•2h ago
The story here isn't that they've invented a new format for user defined indexes (the one proposed here is sort of contrived and I probably wouldn't recommend in production) but rather demonstrating how the user defined metadata space of the parquet format can be used for application specific purposes.

I work on a database engine that uses parquet as our on-storage file format and we make liberal use of the custom metadata area for things specific to our product that any other parquet readers would just ignore.

gregw2•40m ago
Note that there are "Puffin files" associated with Iceberg which have some overlap with this functionality: https://iceberg.apache.org/puffin-spec/#file-structure