frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Immigration raids leave crops unharvested and California farms at risk

https://www.japantimes.co.jp/business/2025/07/01/markets/immigration-raids-california-farms/
1•PaulHoule•56s ago•0 comments

MethaneSAT 'likely not recoverable,' lost its link to Earth

https://www.theregister.com/2025/07/02/methanesat_likely_not_recoverable/
1•_Microft•2m ago•0 comments

Seventeenth Amendment to the United States Constitution

https://en.wikipedia.org/wiki/Seventeenth_Amendment_to_the_United_States_Constitution
1•Bluestein•2m ago•0 comments

Commission publishes guidelines on the protection of minors

https://digital-strategy.ec.europa.eu/en/library/commission-publishes-guidelines-protection-minors
1•giuliomagnifico•4m ago•0 comments

My Robot Teacher – how to use AI in higher education

https://calearninglab.org/myrobotteacher/
2•auton1•5m ago•0 comments

XBOW Battles Ninja Tables

https://xbow.com/blog/xbow-ninja-tables/
1•wslh•8m ago•0 comments

NeuralOS: An Operating System Powered by Neural Networks

https://neural-os.com/
2•yuntian•8m ago•1 comments

Oceans: Theory to Implementation

http://gikster.dev/posts/Ocean-Simulation/
2•davikr•10m ago•0 comments

Why I'm Writing Pure HTML and CSS in 2025

https://joeldare.com/why-im-writing-pure-html-and-css-in-2025
2•codazoda•10m ago•0 comments

Is the U.S. Ready for the Next War?

https://www.newyorker.com/magazine/2025/07/21/is-the-us-ready-for-the-next-war
1•fortran77•14m ago•1 comments

Goldman Sachs doesn't have to hire a $180k software engineer–meet Devin

https://www.bloomberg.com/news/articles/2025-04-09/trump-asks-supreme-court-to-let-him-fire-top-agency-officials
2•leptoniscool•16m ago•0 comments

Panasonic opens country's largest EV battery plant in De Soto, Kansas

https://www.kctv5.com/2025/07/14/panasonic-opens-300-acre-manufacturing-plant-de-soto-kansas/
3•lenerdenator•17m ago•0 comments

A Concept for Reimagining Browser Bookmarks

https://chromewebstore.google.com/detail/retrace-extension/amplkfldacppobiogcnjipegoekcmimc
1•edsonresearch•18m ago•0 comments

Show HN: Pentra Desktop – Local pentesting tool for automated report generation

https://pentra.ai/
2•bmunteanu•18m ago•0 comments

Ask HN: DAO governance beyond crypto treasuries?

1•peterkeller•21m ago•0 comments

What Is Ears? The Easy Approach to Requirements Syntax (Ears)

https://alistairmavin.com/ears/
1•Bluestein•23m ago•0 comments

Multi-agent framework and user workflows for data analysis [video]

https://www.youtube.com/watch?v=H3xQf9Q3Y_A
1•fromthegut•28m ago•0 comments

SceneScript: An AI model and method to understand and describe 3D spaces

https://www.projectaria.com/scenescript/?_fb_noscript=1
2•pr337h4m•30m ago•0 comments

Give and Take: An End-to-End Investigation of Giveaway Scam Conversion Rates

https://arxiv.org/abs/2405.09757
3•paulpauper•34m ago•0 comments

Ani's Character Profile in Grok

https://twitter.com/techdevnotes/status/1944739778143936711
3•pr337h4m•35m ago•0 comments

Ask HN: Why isn’t Hollywood producing WWIII films in these perilous times?

4•amichail•36m ago•9 comments

Guessing the Player's Sunrise

https://docs.getlost.gg/2.0.0/blog/sun-time/
2•todsacerdoti•36m ago•0 comments

The gains from trade are not the gains from trade

https://nicholasdecker.substack.com/p/the-gains-from-trade-are-not-the
2•yorwba•36m ago•0 comments

Plastic surgeon off the hook for alleged Covid fraud, injecting kids with saline

https://arstechnica.com/health/2025/07/bondi-drops-case-on-doc-accused-of-giving-kids-saline-shots-instead-of-vaccines/
5•duxup•36m ago•1 comments

Claude Code token usage and costs from local JSONL files

https://github.com/ryoppippi/ccusage
3•jbernardo95•36m ago•0 comments

Iceberg Is Wrong

https://database-doctor.com/posts/iceberg-is-wrong-2.html
2•redixhumayun•37m ago•0 comments

The Battle for Britain's First Book of the Month Club

https://www.historytoday.com/archive/history-matters/battle-britains-first-book-month-club
2•samclemens•37m ago•0 comments

Show HN: Context Rot Technical Report – How Input Length Impacts LLM Performance

https://research.trychroma.com/context-rot
3•kellyhongsn•37m ago•1 comments

AI is killing the web. Can anything save it?

https://www.economist.com/business/2025/07/14/ai-is-killing-the-web-can-anything-save-it
2•farseer•40m ago•0 comments

Improving AVIF in Open Source

https://halide.cx/blog/improving-avif-in-open-source/index.html
2•computerbuster•41m ago•0 comments
Open in hackernews

Embedding user-defined indexes in Apache Parquet

https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
52•jasim•3h ago

Comments

jasim•2h ago
I think this post is a response to some new file format initiatives, based on the criticism that the Parquet file format is showing its age.

One of the arguments is that there is no standardized way to extend Parquet with new kinds of metadata (like statistical summaries, HyperLogLog etc.)

This post was written by the DataFusion folks, who have shown a clever way to do this without breaking backward compatibility with existing readers.

They have inserted arbitrary data between footer and data pages, which other readers will ignore. But query engines like DataFusion can exploit it. They embed a new index to the .parquet file, and use that to improve query performance.

In this specific instance, they add an index with all the distinct values of a column. Then they extend the DataFusion query engine to exploit that so that queries like `WHERE nation = 'Singapore'` can use that index to figure out whether the value exists in that .parquet file without having to scan the data pages (which is already optimized because there is a min-max filter to avoid scanning the entire dataset).

Also in general this is a really good deep dive into columnar data storage.

Nelkins•1h ago
Cool, but this is very specific to DataFusion, no? Is there any chance this would be standardized so other Parquet readers could leverage the same technique?
gdubya•49m ago
The technique can be applied by any engine, not just DataFusion. Each engine would have to know about the indexes in order to make use of them, but the fallback to parquet standard defaults means that the data is still readable by all.
aerzen•39m ago
But does data fusion publish a specification of how this metadata can be read, along with a test suite for verifying implementations? Because if they don't, this cannot be reliably used by any other impl
jasim•20m ago
Parquet files include a field called key_value_metadata in the FileMetadata structure; it sits in the footer of the file. See: https://github.com/apache/parquet-format/blob/master/src/mai...

The technique described in the article, seems to use this key-value pair to store pointers to the additional metadata (in this case a distinct index) embedded in the file. Note that we can embed arbitrary binary data in the Parquet file between each data page. This is perfectly valid since all Parquet readers rely on the exact offsets to the data pages specified in the footer.

This means that DataFusion does not need to specify how the metadata is interpreted. It is already well specified as part of the Parquet file format itself. DataFusion is an independent project -- it is a query execution engine for OLAP / columnar data, which can take in SQL statements, build query plan, optimize them, and execute. It is an embeddable runtime with numerous ways to extend it by the host program. Parquet is a file format supported by DataFusion because it is one of the most popular ways of storing data in a columnar way in object storages like S3.

Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes - as long as we're embedding only supplementary information like indices or bloom filters, a reader can still continue working with the columnar data in Parquet as it used to; it is just that it won't be able to take advantage of the additional metadata.