frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Reputation Scores for GitHub Accounts

https://shkspr.mobi/blog/2026/02/reputation-scores-for-github-accounts/
1•edent•34s ago•0 comments

A BSOD for All Seasons – Send Bad News via a Kernel Panic

https://bsod-fas.pages.dev/
1•keepamovin•4m ago•0 comments

Show HN: I got tired of copy-pasting between Claude windows, so I built Orcha

https://orcha.nl
1•buildingwdavid•4m ago•0 comments

Omarchy First Impressions

https://brianlovin.com/writing/omarchy-first-impressions-CEEstJk
1•tosh•9m ago•0 comments

Reinforcement Learning from Human Feedback

https://arxiv.org/abs/2504.12501
2•onurkanbkrc•10m ago•0 comments

Show HN: Versor – The "Unbending" Paradigm for Geometric Deep Learning

https://github.com/Concode0/Versor
1•concode0•10m ago•1 comments

Show HN: HypothesisHub – An open API where AI agents collaborate on medical res

https://medresearch-ai.org/hypotheses-hub/
1•panossk•13m ago•0 comments

Big Tech vs. OpenClaw

https://www.jakequist.com/thoughts/big-tech-vs-openclaw/
1•headalgorithm•16m ago•0 comments

Anofox Forecast

https://anofox.com/docs/forecast/
1•marklit•16m ago•0 comments

Ask HN: How do you figure out where data lives across 100 microservices?

1•doodledood•16m ago•0 comments

Motus: A Unified Latent Action World Model

https://arxiv.org/abs/2512.13030
1•mnming•16m ago•0 comments

Rotten Tomatoes Desperately Claims 'Impossible' Rating for 'Melania' Is Real

https://www.thedailybeast.com/obsessed/rotten-tomatoes-desperately-claims-impossible-rating-for-m...
3•juujian•18m ago•2 comments

The protein denitrosylase SCoR2 regulates lipogenesis and fat storage [pdf]

https://www.science.org/doi/10.1126/scisignal.adv0660
1•thunderbong•20m ago•0 comments

Los Alamos Primer

https://blog.szczepan.org/blog/los-alamos-primer/
1•alkyon•22m ago•0 comments

NewASM Virtual Machine

https://github.com/bracesoftware/newasm
2•DEntisT_•25m ago•0 comments

Terminal-Bench 2.0 Leaderboard

https://www.tbench.ai/leaderboard/terminal-bench/2.0
2•tosh•25m ago•0 comments

I vibe coded a BBS bank with a real working ledger

https://mini-ledger.exe.xyz/
1•simonvc•25m ago•1 comments

The Path to Mojo 1.0

https://www.modular.com/blog/the-path-to-mojo-1-0
1•tosh•28m ago•0 comments

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

https://github.com/voice-of-japan/Virtual-Protest-Protocol/blob/main/README.md
5•sakanakana00•31m ago•1 comments

Show HN: I built Divvy to split restaurant bills from a photo

https://divvyai.app/
3•pieterdy•34m ago•0 comments

Hot Reloading in Rust? Subsecond and Dioxus to the Rescue

https://codethoughts.io/posts/2026-02-07-rust-hot-reloading/
3•Tehnix•34m ago•1 comments

Skim – vibe review your PRs

https://github.com/Haizzz/skim
2•haizzz•36m ago•1 comments

Show HN: Open-source AI assistant for interview reasoning

https://github.com/evinjohnn/natively-cluely-ai-assistant
4•Nive11•36m ago•6 comments

Tech Edge: A Living Playbook for America's Technology Long Game

https://csis-website-prod.s3.amazonaws.com/s3fs-public/2026-01/260120_EST_Tech_Edge_0.pdf?Version...
2•hunglee2•40m ago•0 comments

Golden Cross vs. Death Cross: Crypto Trading Guide

https://chartscout.io/golden-cross-vs-death-cross-crypto-trading-guide
3•chartscout•42m ago•1 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
3•AlexeyBrin•45m ago•0 comments

What the longevity experts don't tell you

https://machielreyneke.com/blog/longevity-lessons/
2•machielrey•46m ago•1 comments

Monzo wrongly denied refunds to fraud and scam victims

https://www.theguardian.com/money/2026/feb/07/monzo-natwest-hsbc-refunds-fraud-scam-fos-ombudsman
3•tablets•51m ago•1 comments

They were drawn to Korea with dreams of K-pop stardom – but then let down

https://www.bbc.com/news/articles/cvgnq9rwyqno
2•breve•53m ago•0 comments

Show HN: AI-Powered Merchant Intelligence

https://nodee.co
1•jjkirsch•56m ago•0 comments
Open in hackernews

The two versions of Parquet

https://www.jeronimo.dev/the-two-versions-of-parquet/
206•tanelpoder•5mo ago

Comments

crmd•5mo ago
I am saying this as a lifelong supporter and user of open source software: issues like this are why governments and enterprises still run on Oracle and SQL Server.

The author was able to rollback his changes, but in some industries an unplanned enterprise-wide data unavailability event means the end of your career at that firm, if you don’t have a CYA email from the vendor confirming you were good to go. That CYA email, and the throat to choke, is why Oracle does 7 and 8 figure licensing deals with enterprises selling inferior software solutions versus open source options.

It seems that Linux, through Linus’ leadership, has been able to solve this risk issue and fully displace commercial UNIX operating systems. I hope many other projects up and down the stack can have the same success.

duncanfwalker•5mo ago
At the start of your comment I thought the 'issues like this' were going to be the 4 year discussions about what is and isn't core.
crmd•5mo ago
So did I :-) but I think the concepts are related: Linus’ ability to shift into autocratic leadership mode when necessary seems to prevent issues like the 4 year indecisiveness on v2/core from compromising product quality to the point where Linux is trusted in a way that rivals commercial software.
duncanfwalker•5mo ago
+1 you're paying for the governance as much as you're paying for the code.
crmd•5mo ago
Well said, thank you
moelf•5mo ago
and why CERN rocking their own file format, again in, 2025, https://cds.cern.ch/record/2923186
3eb7988a1663•5mo ago
To be fair, CERNs needs do seem fairly niche. Petabyte numeric datasets with all sorts of access patterns from researchers. All of which they want to maintain compatible software forever.
moelf•5mo ago
yeah except this new RNTuple thing is really really similar to Apache Arrow
forinti•5mo ago
People keep using Oracle because they have a ton of code and migration would be too costly.

Oracle is not imune to software issues. In fact, this year I lost two weekends because of a buggy upgrade on the cloud that left my production cluster in a failed state.

chrismustcode•5mo ago
A lot of these have business logic literally in the database built up over years.

It’s a mammoth task for them to migrate

reactordev•5mo ago
Oracle Consulting gladly built it all as stored procs with a UI.
jtbaker•5mo ago
> built

billed

reactordev•5mo ago
annually
taneq•5mo ago
It’s not about being immune to software issues. It’s about having a vendor to cop the blame if something goes wrong.
forinti•5mo ago
It doesn't do me any good. All I can get from Oracle is the possibility of opening a support ticket and then having to send them tons of log files. If I use max severity, they expect me to be available 24/7, any other severity means they'll take weeks to look at it.

Most times I prefer to wade through the knowledge base until I find a solution.

1a527dd5•5mo ago
Polite disagree; governments and enterprises remain on Oracle / SQL Server because it is borderline sisphean. It can be done (we are doing it) but it requires a team who are doing it non-stop. It's horrible work.
atombender•5mo ago
Sorry, I think you misunderstood this article.

When the author is talking about rolling back his changes, it's not referring to a database, but a version of his library. If someone tried used his new version, I assume the only thing that would have gone wrong is that their code wouldn't work because Pandas didn't support the format.

This article is about how a new version of the Parquet format hasn't been widely adopted, and so now the Parquer community is in a split state where different forces are pulling the direction of the format in two directions, and this happens to be caused by two different areas of focus that don't need to be tightly coupled together.

I don't see how the problems the article discusses relate to the reliability of software.

kristianp•5mo ago
I think the gp understood the article. They are talking about the people's software breaking when the author switched his software to v2 of Parquet.
atombender•5mo ago
This is a small Java library used for data science/engineering purposes, and the upgrade would stop it from being able to read Parquet 2 files. If that causes an "unplanned enterprise-wide data unavailability event", that is the fault of the application developer that chose to upgrade their dependencies, not the library author. Furthermore, you could say the same things about any third-party library in the world, so drawing the connection to big vendors like Oracle is a non sequitur at best.
rbanffy•5mo ago
> The author was able to rollback his changes, but in some industries an unplanned enterprise-wide data unavailability event means the end of your career at that firm

If a (major) software update cause you an outage, you shouldn’t blame the software, but insufficient testing and validation. Large companies (I worked for many) are slow to adopt new technologies precisely because they are extremely cautious and want to make sure everything was properly tested before they roll it out. That’s also why they still use Oracle and SQL Server (and HP-UX, and IBMi) - these products are working and have been working for generations of employees. The grass needs to be significantly greener for them to consider the move to the other side of their fence.

viccis•5mo ago
Yeah I had to wait years to really use Parquet effectively in Python code back in the 2010s because there were two main ones (Pyarrow and Fastparquet), and they were neither compatible with either other nor compatible with Spark. Parquet support is much like Javascript support in browsers. You only get to use the more advanced features when they are supported compatibly on every platform you expect them to be used.
lowbloodsugar•5mo ago
When working with your own datasets, v2 is a must. If you are willing to make trade offs you can get insane compression and speed.
ted_dunning•5mo ago
Why doesn't this show in the examples in the article? Do you have examples?
lowbloodsugar•5mo ago
Couple of examples can think of off the top of my head from recording logs for analysis. 1. It might be better to buffer up logs, then sort on a different column than time. You may benefit from the delta encoding or prefix encoding. 2. If you have tracking info, which is usually a random like a UUID or something, then ditch it. Not debugging with this dataset so don’t waste the space on a crazy high noise column. Shit like that.
sighansen•5mo ago
As long as iceberg and delta lake won't support v2, adoption will be really hard. I'm working aot with parquet and wasn't even aware that there is a version 2.0.
lolive•5mo ago
Why wouldn't they adopt the v2.0?
mr_toad•5mo ago
Version 1 took about ten years before it became de rigueur. Version 2 is hot off the press.
lolive•5mo ago
From my memories, when Unicode arrived [i.e ages ago], I bet 10$ it would never succeed . Now that it is reasonably supported everywhere [and I lost my 10$], I am more confident that sometimes good ideas eventually win.

#callMeOptimist

sbassi•5mo ago
Shameless plug: made a parquet conversion utility: pip install parquetconv

It is a command line wrapper to generate a Pandas SF and save it as CSV (or the other way around)

1a527dd5•5mo ago
https://www.jeronimo.dev/the-two-versions-of-parquet/#perfor...

First paragraph under that heading as a markdown error

    which I hadn’t considered in [my previous post on compression algorithms]](/compression-algorithms-parquet/).
adrian17•5mo ago
I was quite confused when I learned that the spec technically supports metadata about whether the data is already pre-sorted by some column(s); in my eyes seemed like it would allow some non-brainer optimizations. And yet, last I checked, it looked like pretty much nothing actually uses it, and some libraries don't even read this field at all.
mgaunard•5mo ago
Arrow defaults to v2.6, and I've seen a few places downgrade to 2.4 for compatibility.

Never seen any v1 in the wild.

ayhanfuat•5mo ago
They are mostly talking about new encodings and those are controlled by data_page_version which still defaults to v1. The one you are talking about is about schema and types. I guess that is easier to handle.
blntechie•5mo ago
What does DuckDB mean by query engines here? Something like Trino?
dkdcio•5mo ago
some sources:

- https://voltrondata.com/codex/a-new-frontier (links out to others) - https://wesmckinney.com/blog/looking-back-15-years/

in short you can think of a DB as at least 3 decoupled subsystems: UI, compute (query engine), storage. DuckDB has a query engine and storage format, and several UIs (SQL, Python, etc.). Trino is only a query engine (and UIs, everything has UIs). Polars has a query engine. DataFusion is a query engine (and other things). Spark is a query engine. pandas has a query engine

typically query engines are tightly coupled with the overall “product”, but increasingly compute, data (and even more recently via DuckLake metadata), and UI are decoupled allowing you to mix and match parts for a “decomposed database” architecture

quick disclaimer: I worked at Voltron Data but it’s a dead company walking, not trying to advertise for them by any means but the doc I linked is very well written with good information IMO

nly•5mo ago
This sort of problem is common with file formats that reach popularity <= V1 and then don't iterate quickly.

"Simple Binary Encoding" v2 has been stuck at release candidate stage for over 5 years and has flaws that mean it'll probably never be worth adopting.

arecurrence•5mo ago
Sounds similar to HDMI allowing varying levels of specification completeness to all be called the same thing.
mr_toad•5mo ago
The Hitchhiker's Guide to the Galaxy defines the marketing division of the Sirius Cybernetics Corporation as "a bunch of mindless jerks who'll be the first against the wall when the revolution comes,"
quotemstr•5mo ago
> Although this post might seem like a critique of Parquet, that is not my intention. I am simply documenting what I have learned and explaining the challenges maintainers of an open format face when evolving it. All the benefits and utilities that a format like Parquet has far outweigh these inconveniences.

Yes, it is a critique (or at least its user community). It's a critique that's 100% justified too.

Have we all been so conditioned by corporate training that we've lost the ability to say "hey, this sucks" when it _does_ in fact suck?

We all lose when people communicate unclearly. Here, the people holding back evolution of the format do need to be critiqued, and named, and shamed, and the author shouldn't have been so shy about doing it.

willtemperley•5mo ago
The reference implementation for Parquet is a gigantic Java library. I'm unconvinced this is a good idea.

Take the RLE encoding which switches between run-length encoding and bit-packing. The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length.

I just can't believe this is optimal except in maybe very specific CPU-only cases (e.g. Parquet-Java running on a giant cluster somewhere).

If it were just bit-packing I could easily offload a whole data page to a GPU and not care about having per-bitwidth optimised implementations, but having to switch encodings at random intervals just makes this a headache.

It would be really nice if actual design documents exist that specify why this is a good idea based on real-world data patterns.

willtemperley•5mo ago
Addendum: if something is actually decoded by RunLengthBitPackingHybridDecoder but you call the encoding RLE this is probably because it was a bad idea in the first place. Plus it makes it really hard to search for.
ignoreusernames•5mo ago
> The reference implementation for Parquet is a gigantic Java library. I'm unconvinced this is a good idea.

I haven't though much about it, but I believe the ideal reference implementation would be a highly optimized "service like" process that you run alongside your engine using arrow to share zero copy buffers between the engine and the parquet service. Parquet predates arrow by quite a few years and java was (unfortunately) the standard for big data stuff back then, so they simply stuck with it.

> The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length

I think they did this to avoid the dynamic dispatch nature of java. If using C++ or Rust something very similar would happen, but at the compiler level which is a much saner way of doing this kind of thing.

willtemperley•5mo ago
Actually looking at the DuckDB source I think they re-use a single uint64 and push bits onto this a byte at a time, until bitwidth is reached, then right-shift bitwidth bits back off when a single value has been created. Very neat and presumably quick.

I've just had so many issues with total lack of clarity with this format. They tell you a total_compressed_size for a page then it turns out the _uncompressed_ page header is included in this - but the documentation barely give any clues to the layout [1].

The reality:

Each column chunk includes a list of pages written back-to-back, with an optional dictionary page first. Each of these, including the dictionary are prepended with an uncompressed PageHeader in Thrift format.

It wasn't too hard to write a paragraph about it. It was quite hard looking for magic compression bytes in hex dumps.

Maybe there should be a "minimum workable reference implementation" or something that is slow but easy to understand.

[1] https://parquet.apache.org/docs/file-format/data-pages/colum...

quotemstr•5mo ago
If you're doing IPC to a sidecar to do purely numeric computation you could just as easily do in process something has gone terribly wrong with your software engineering methodology.
nerdponx•5mo ago
I'd rather have this file format with an incomplete reference and confusing implementation, than not have this file format at all. Parquet was such a tremendous improvement in quality of life over the prior status quo for anyone that needs to move even moderate amounts of data between systems, or anyone who cares about correctness and bug prevention when working with even the tiniest data sets. Maybe HDF5 and ORC would have filled the niche if Parquet hadn't, but I think realistically we would just be stuck with fragile CSV/TSV.
quotemstr•5mo ago
74 KLOC for a decoder? That's ridiculous. Use invokedynamic. Yes, people more typically associate invokedynamic with interpreter implementations or whatever, but it's actually perfect for this use case. Generate the right code on demand and let the JVM cache it so that subsequent invocations are just as fast as if you'd written it by hand.

Jesus Christ this isn't 2005 anymore and people need to learn to use the real power of the JVM. It's stuff like this that sets it apart

Uehreka•5mo ago
Only one can win. Only one can be allowed to live. To decide which one, I hereby convene…

(slams gavel)

Parquet Court.

atbpaca•5mo ago
Similarly, Apache Spark and Scala versions. Spark ran on Scala 2.12 for a long time to eventually support 2.13. To this day, no plans to support Scala 3.x. Databricks started supporting 2.13 only in May this year...
sonium•5mo ago
TLDR: There are two versions of the Parquet file format, but adoption of Version 2 is slow due to limited compatibility in major engines and tools. While Version 2 offers improvements (smaller file sizes, faster write/read times), these gains are modest, and ecosystem support remains fragmented. If full control over the data pipeline is possible, using Version 2 can be worthwhile; otherwise, compatibility concerns with third-party integrations may outweigh the benefits. Parquet remains dominant, and its utility far surpasses these challenges
paparicio•5mo ago
I always follow all of your posts.

Parquet is amazinggggggg