frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Why is choral music harder to appreciate?

https://marginalrevolution.com/marginalrevolution/2025/08/why-is-choral-music-harder-to-appreciat...
11•surprisetalk•2d ago•2 comments

Git-Annex

https://git-annex.branchable.com/
5•keepamovin•54m ago•0 comments

Show HN: Sping – An HTTP/TCP latency tool that's easy on the eye

https://dseltzer.gitlab.io/sping/docs/
81•zorlack•5h ago•5 comments

Busy beaver hunters reach numbers that overwhelm ordinary math

https://www.quantamagazine.org/busy-beaver-hunters-reach-numbers-that-overwhelm-ordinary-math-202...
70•defrost•2d ago•13 comments

From Hackathon to YC

https://www.producthunt.com/p/april-yc-s25/from-hackathon-to-yc
12•rmason•7h ago•9 comments

The two versions of Parquet

https://www.jeronimo.dev/the-two-versions-of-parquet/
147•tanelpoder•3d ago•34 comments

We put a coding agent in a while loop

https://github.com/repomirrorhq/repomirror/blob/main/repomirror.md
168•sfarshid•12h ago•114 comments

Is 4chan the perfect Pirate Bay poster child to justify wider UK site-blocking?

https://torrentfreak.com/uk-govt-finds-ideal-pirate-bay-poster-boy-to-sell-blocking-of-non-pirate...
199•gloxkiqcza•12h ago•174 comments

German contest to live in depopulated Soviet-era city proves global hit

https://www.theguardian.com/world/2025/aug/21/german-contest-to-live-in-depopulated-soviet-era-ci...
37•c420•3d ago•36 comments

Y Combinator files brief supporting Epic Games, says store fees stifle startups

https://www.macrumors.com/2025/08/21/y-combinator-epic-games-amicus-brief/
128•greenburger•3d ago•114 comments

The Unix-Haters Handbook (1994) [pdf]

https://simson.net/ref/ugh.pdf
15•oliverkwebb•4h ago•2 comments

Ghrc.io appears to be malicious

https://bmitch.net/blog/2025-08-22-ghrc-appears-malicious/
280•todsacerdoti•5h ago•36 comments

Trees on city streets cope with drought by drinking from leaky pipes

https://www.newscientist.com/article/2487804-trees-on-city-streets-cope-with-drought-by-drinking-...
160•bookofjoe•2d ago•85 comments

Burner Phone 101

https://rebeccawilliams.info/burner-phone-101/
308•CharlesW•4d ago•124 comments

Making games in Go: 3 months without LLMs vs. 3 days with LLMs

https://marianogappa.github.io/software/2025/08/24/i-made-two-card-games-in-go/
270•maloga•14h ago•190 comments

Uncle Sam shouldn't own Intel stock

https://www.wsj.com/opinion/uncle-sam-shouldnt-own-intel-stock-ccd6986d
103•aspenmayer•7h ago•111 comments

A Brilliant and Nearby One-off Fast Radio Burst Localized to 13 pc Precision

https://iopscience.iop.org/article/10.3847/2041-8213/adf62f
55•gnabgib•9h ago•8 comments

Show HN: Decentralized Bitcoin Incentives via QR Codes

https://github.com/DT7QR/Bitcoin-Rewards-System-Proposal
8•Yodan2025•3h ago•0 comments

Everything I know about good API design

https://www.seangoedecke.com/good-api-design/
229•ahamez•10h ago•84 comments

Bash Strict Mode (2014)

http://redsymbol.net/articles/unofficial-bash-strict-mode/
32•dcminter•2d ago•26 comments

Cloudflare incident on August 21, 2025

https://blog.cloudflare.com/cloudflare-incident-on-august-21-2025/
154•achalshah•3d ago•32 comments

Show HN: Clearcam – Add AI object detection to your IP CCTV cameras

https://github.com/roryclear/clearcam
170•roryclear•17h ago•47 comments

How many paths of length K are there between A and B? (2021)

https://horace.io/walks
22•jxmorris12•9h ago•4 comments

Halt and Catch Fire Syllabus (2021)

https://bits.ashleyblewer.com/halt-and-catch-fire-syllabus/
121•Kye•8h ago•34 comments

My ZIP isn't your ZIP: Identifying and exploiting semantic gaps between parsers

https://www.usenix.org/conference/usenixsecurity25/presentation/you
48•layer8•3d ago•19 comments

Claim: GPT-5-pro can prove new interesting mathematics

https://twitter.com/SebastienBubeck/status/1958198661139009862
129•marcuschong•4d ago•86 comments

How to check if your Apple Silicon Mac is booting securely

https://eclecticlight.co/2025/08/21/how-to-check-if-your-apple-silicon-mac-is-booting-securely/
63•shorden•5h ago•13 comments

Show HN: I Built a XSLT Blog Framework

https://vgr.land/content/posts/20250821.xml
41•vgr-land•11h ago•16 comments

Comet AI browser can get prompt injected from any site, drain your bank account

https://twitter.com/zack_overflow/status/1959308058200551721
506•helloplanets•13h ago•177 comments

NASA's Juno mission leaves legacy of science at Jupiter

https://www.scientificamerican.com/article/how-nasas-juno-probe-changed-everything-we-know-about-...
68•apress•3d ago•29 comments
Open in hackernews

The two versions of Parquet

https://www.jeronimo.dev/the-two-versions-of-parquet/
147•tanelpoder•3d ago

Comments

crmd•8h ago
I am saying this as a lifelong supporter and user of open source software: issues like this are why governments and enterprises still run on Oracle and SQL Server.

The author was able to rollback his changes, but in some industries an unplanned enterprise-wide data unavailability event means the end of your career at that firm, if you don’t have a CYA email from the vendor confirming you were good to go. That CYA email, and the throat to choke, is why Oracle does 7 and 8 figure licensing deals with enterprises selling inferior software solutions versus open source options.

It seems that Linux, through Linus’ leadership, has been able to solve this risk issue and fully displace commercial UNIX operating systems. I hope many other projects up and down the stack can have the same success.

duncanfwalker•8h ago
At the start of your comment I thought the 'issues like this' were going to be the 4 year discussions about what is and isn't core.
crmd•7h ago
So did I :-) but I think the concepts are related: Linus’ ability to shift into autocratic leadership mode when necessary seems to prevent issues like the 4 year indecisiveness on v2/core from compromising product quality to the point where Linux is trusted in a way that rivals commercial software.
moelf•7h ago
and why CERN rocking their own file format, again in, 2025, https://cds.cern.ch/record/2923186
3eb7988a1663•7h ago
To be fair, CERNs needs do seem fairly niche. Petabyte numeric datasets with all sorts of access patterns from researchers. All of which they want to maintain compatible software forever.
moelf•7h ago
yeah except this new RNTuple thing is really really similar to Apache Arrow
forinti•7h ago
People keep using Oracle because they have a ton of code and migration would be too costly.

Oracle is not imune to software issues. In fact, this year I lost two weekends because of a buggy upgrade on the cloud that left my production cluster in a failed state.

chrismustcode•6h ago
A lot of these have business logic literally in the database built up over years.

It’s a mammoth task for them to migrate

reactordev•6h ago
Oracle Consulting gladly built it all as stored procs with a UI.
jtbaker•6h ago
> built

billed

reactordev•5h ago
annually
taneq•6h ago
It’s not about being immune to software issues. It’s about having a vendor to cop the blame if something goes wrong.
1a527dd5•6h ago
Polite disagree; governments and enterprises remain on Oracle / SQL Server because it is borderline sisphean. It can be done (we are doing it) but it requires a team who are doing it non-stop. It's horrible work.
atombender•6h ago
Sorry, I think you misunderstood this article.

When the author is talking about rolling back his changes, it's not referring to a database, but a version of his library. If someone tried used his new version, I assume the only thing that would have gone wrong is that their code wouldn't work because Pandas didn't support the format.

This article is about how a new version of the Parquet format hasn't been widely adopted, and so now the Parquer community is in a split state where different forces are pulling the direction of the format in two directions, and this happens to be caused by two different areas of focus that don't need to be tightly coupled together.

I don't see how the problems the article discusses relate to the reliability of software.

viccis•8h ago
Yeah I had to wait years to really use Parquet effectively in Python code back in the 2010s because there were two main ones (Pyarrow and Fastparquet), and they were neither compatible with either other nor compatible with Spark. Parquet support is much like Javascript support in browsers. You only get to use the more advanced features when they are supported compatibly on every platform you expect them to be used.
lowbloodsugar•8h ago
When working with your own datasets, v2 is a must. If you are willing to make trade offs you can get insane compression and speed.
ted_dunning•4h ago
Why doesn't this show in the examples in the article? Do you have examples?
sighansen•8h ago
As long as iceberg and delta lake won't support v2, adoption will be really hard. I'm working aot with parquet and wasn't even aware that there is a version 2.0.
lolive•7h ago
Why wouldn't they adopt the v2.0?
mr_toad•3h ago
Version 1 took about ten years before it became de rigueur. Version 2 is hot off the press.
sbassi•7h ago
Shameless plug: made a parquet conversion utility: pip install parquetconv

It is a command line wrapper to generate a Pandas SF and save it as CSV (or the other way around)

1a527dd5•6h ago
https://www.jeronimo.dev/the-two-versions-of-parquet/#perfor...

First paragraph under that heading as a markdown error

    which I hadn’t considered in [my previous post on compression algorithms]](/compression-algorithms-parquet/).
adrian17•6h ago
I was quite confused when I learned that the spec technically supports metadata about whether the data is already pre-sorted by some column(s); in my eyes seemed like it would allow some non-brainer optimizations. And yet, last I checked, it looked like pretty much nothing actually uses it, and some libraries don't even read this field at all.
mgaunard•6h ago
Arrow defaults to v2.6, and I've seen a few places downgrade to 2.4 for compatibility.

Never seen any v1 in the wild.

ayhanfuat•3m ago
They are mostly talking about new encodings and those are controlled by data_page_version which still defaults to v1. The one you are talking about is about schema and types. I guess that is easier to handle.
blntechie•6h ago
What does DuckDB mean by query engines here? Something like Trino?
dkdcio•3h ago
some sources:

- https://voltrondata.com/codex/a-new-frontier (links out to others) - https://wesmckinney.com/blog/looking-back-15-years/

in short you can think of a DB as at least 3 decoupled subsystems: UI, compute (query engine), storage. DuckDB has a query engine and storage format, and several UIs (SQL, Python, etc.). Trino is only a query engine (and UIs, everything has UIs). Polars has a query engine. DataFusion is a query engine (and other things). Spark is a query engine. pandas has a query engine

typically query engines are tightly coupled with the overall “product”, but increasingly compute, data (and even more recently via DuckLake metadata), and UI are decoupled allowing you to mix and match parts for a “decomposed database” architecture

quick disclaimer: I worked at Voltron Data but it’s a dead company walking, not trying to advertise for them by any means but the doc I linked is very well written with good information IMO

nly•6h ago
This sort of problem is common with file formats that reach popularity <= V1 and then don't iterate quickly.

"Simple Binary Encoding" v2 has been stuck at release candidate stage for over 5 years and has flaws that mean it'll probably never be worth adopting.

arecurrence•5h ago
Sounds similar to HDMI allowing varying levels of specification completeness to all be called the same thing.
mr_toad•3h ago
The Hitchhiker's Guide to the Galaxy defines the marketing division of the Sirius Cybernetics Corporation as "a bunch of mindless jerks who'll be the first against the wall when the revolution comes,"
quotemstr•3h ago
> Although this post might seem like a critique of Parquet, that is not my intention. I am simply documenting what I have learned and explaining the challenges maintainers of an open format face when evolving it. All the benefits and utilities that a format like Parquet has far outweigh these inconveniences.

Yes, it is a critique (or at least its user community). It's a critique that's 100% justified too.

Have we all been so conditioned by corporate training that we've lost the ability to say "hey, this sucks" when it _does_ in fact suck?

We all lose when people communicate unclearly. Here, the people holding back evolution of the format do need to be critiqued, and named, and shamed, and the author shouldn't have been so shy about doing it.

willtemperley•2h ago
The reference implementation for Parquet is a gigantic Java library. I'm unconvinced this is a good idea.

Take the RLE encoding which switches between run-length encoding and bit-packing. The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length.

I just can't believe this is optimal except in maybe very specific CPU-only cases (e.g. Parquet-Java running on a giant cluster somewhere).

If it were just bit-packing I could easily offload a whole data page to a GPU and not care about having per-bitwidth optimised implementations, but having to switch encodings at random intervals just makes this a headache.

It would be really nice if actual design documents exist that specify why this is a good idea based on real-world data patterns.

willtemperley•1h ago
Addendum: if something is actually decoded by RunLengthBitPackingHybridDecoder but you call the encoding RLE this is probably because it was a bad idea in the first place. Plus it makes it really hard to search for.
Uehreka•1h ago
Only one can win. Only one can be allowed to live. To decide which one, I hereby convene…

(slams gavel)

Parquet Court.