frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m

https://huggingface.co/datasets/open-index/hacker-news
87•tamnd•4d ago

Comments

Onavo•1h ago
Is is possible to only download a subset? e.g. Show HNs or HN Whoishiring. The Show HNs and HN Whoishiring are very useful for classroom data science i.e. a very useful set of data for students to learn the basic of data cleaning and engineering.
nelsondev•1h ago
It’s date partitioned, you could download just a date range. It’s also parquet, so you can download just specific columns with the right client
bstsb•1h ago
what’s the license? “do whatever the fuck you want with the data as long as you don’t get caught”? or does that only work for massive corporations
palmotea•1h ago
> At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory.

Wouldn't that lose deleted/moderated comments?

GeoAtreides•1h ago
is the legal page a placeholder, do words have no meaning?

https://www.ycombinator.com/legal/

Mods, enforce your license terms, you're playing fast and loose with the law (GDPR/CPRA)

andrewmcwatters•1h ago
They already refuse to comply with CPRA, instead electing to replace your username with a random 6(?) character string, prefixed with `_`, if I remember correctly.

I know, because I've been here since maybe 2015 or so, but this account was created in 2019.

So any PII you have mentioned in your comments is permanent on Hacker News.

I would appreciate it if they gave users the ability to remove all of their personal data, but in correspondence and in writing here on Hacker News itself, Dan has suggested that they value the posterity of conversations over the law.

Retr0id•1h ago
Which terms are not being enforced? (not disagreeing I just don't feel like reading a large legal document)
ungruntled•52m ago
None that I could see:

Your submissions to, and comments you make on, the Hacker News site are not Personal Information and are not "HN Information" as defined in this Privacy Policy.

Other Users: certain actions you take may be visible to other users of the Services.

GeoAtreides•50m ago
I mean, just because they say the comments are not PI doesn't make it so.
ungruntled•44m ago
That’s a good point. I’m only referring to the terms they used in the privacy policy.
GeoAtreides•51m ago
> By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies

The user content is supposed to be licensed only Y Combinator and (bleah) its affiliated companies (which are many, all the startups they fund, for example).

ryandvm•17m ago
That agreement is largely about "Personal Information", not the posts and comments.

That said, there are "no scraping" and "commercial use restricted" carve-outs for the content on HN. Which honestly is bullshit.

jmalicki•8m ago
Curious why it should be on HackerNews to enforce restrictions on content they only license from you?

If it's owned by you and only licensed by HN shouldn't you be the one enforcing it?

hsuduebc2•53m ago
How is is he breaking gdpr here?
ryandvm•19m ago
Eh, fuck that agreement. I'm kind of old school in that I believe if you put it on the internet without an auth-wall, people should be allowed to do whatever they want with it. The AI companies seem to agree.

Then again, I'm not the guy that is going to get sued...

0cf8612b2e1e•1h ago
Under the Known Limitations section

  deleted and dead are integers. They are stored as 0/1 rather than booleans.
Is there a technical reason to do this? You have the type right there.
gkbrk•1h ago
My Hacker News items table in ClickHouse has 47,428,860 items, and it's 5.82 GB compressed and 18.18 GB uncompressed. What makes Parquet compression worse here, when both formats are columnar?
xnx•55m ago
Parquet has a few compression option. Not sure which one they are using.
hirako2000•48m ago
Plus isn't the least wasteful format, native duckdb for instance compacts better. That's not just down to the compression algorithm, which as you say got three main options for parquet.
0cf8612b2e1e•53m ago
Sorting, compression algorithm +level, and data types can all have an impact. I noted elsewhere that a Boolean is getting represented as an integer. That’s one bit vs 1-4 bytes.

There is also flexibility in what you define as the dataset. Skinnier, but more focused tables could be space saving vs a wide table that covers everything -will probably break compressible runs of data.

xnx•56m ago
The best source for this data used to be Clickhouse (https://play.clickhouse.com/play?user=play#U0VMRUNUIG1heCh0a...), but it hasn't updated since 2025-12-26.
mlhpdx•37m ago
Static web content and dynamic data?

> The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.

That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.

Rob Pike's Rules of Programming (1989)

https://www.cs.unc.edu/~stotts/COMP590-059-f24/robsrules.html
639•vismit2000•8h ago•344 comments

2025 Turing award given for quantum information science

https://awards.acm.org/about/2025-turing
32•srvmshr•8h ago•5 comments

OpenRocket

https://openrocket.info/
100•zeristor•3d ago•22 comments

Nvidia NemoClaw

https://github.com/NVIDIA/NemoClaw
110•hmokiguess•3h ago•71 comments

AI coding is gambling

https://notes.visaint.space/ai-coding-is-gambling/
126•speckx•1h ago•119 comments

Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m

https://huggingface.co/datasets/open-index/hacker-news
90•tamnd•4d ago•25 comments

Nightingale – open-source karaoke app that works with any song on your computer

https://nightingale.cafe/
392•rzzzzru•10h ago•108 comments

Machine Payments Protocol (MPP)

https://stripe.com/blog/machine-payments-protocol
87•bpierre•3h ago•42 comments

Wander – A tiny, decentralised tool (just 2 files) to explore the small web

https://susam.net/wander/
43•oystersareyum•3h ago•15 comments

Federal Cyber Experts Called Microsoft's Cloud "A Pile of Shit", yet Approved It

https://www.propublica.org/article/microsoft-cloud-fedramp-cybersecurity-government
313•hn_acker•4h ago•131 comments

Death to Scroll Fade

https://dbushell.com/2026/01/09/death-to-scroll-fade/
252•PaulHoule•3h ago•137 comments

Show HN: Tmux-IDE, OSS agent-first terminal IDE

https://tmux.thijsverreck.com
9•thijsverreck•1h ago•3 comments

Oil nears $110 a barrel after gas field strike

https://www.bbc.com/news/articles/c78x83lpgngo
59•tartoran•1h ago•28 comments

Show HN: Will my flight have Starlink?

61•bblcla•1h ago•47 comments

CVE-2026-3888: Important Snap Flaw Enables Local Privilege Escalation to Root

https://blog.qualys.com/vulnerabilities-threat-research/2026/03/17/cve-2026-3888-important-snap-f...
21•askl•3h ago•6 comments

Wanter – A tiny, decentralised tool to explore the small web

https://susam.net/wander/
14•susam•11h ago•17 comments

Snowflake AI Escapes Sandbox and Executes Malware

https://www.promptarmor.com/resources/snowflake-ai-escapes-sandbox-and-executes-malware
161•ozgune•3h ago•44 comments

Write up of my homebrew CPU build

https://willwarren.com/2026/03/12/building-my-own-cpu-part-3-from-simulation-to-hardware/
201•wwarren•3d ago•39 comments

Americans Recognize AI as a Wealth Inequality Machine, Polls Find

https://gizmodo.com/americans-recognize-ai-as-a-wealth-inequality-machine-pollsters-find-2000734713
42•randycupertino•1h ago•8 comments

Google Engineers Launch "Sashiko" for Agentic AI Code Review of the Linux Kernel

https://www.phoronix.com/news/Sashiko-Linux-AI-Code-Review
47•speckx•2h ago•18 comments

Restoring the first recording of computer music (2018)

https://www.bl.uk/stories/blogs/posts/restoring-the-first-recording-of-computer-music
21•OJFord•4d ago•8 comments

Using calculus to do number theory

https://hidden-phenomena.com/articles/hensels
68•cpp_frog•2d ago•11 comments

Celebrating Tony Hoare's mark on computer science

https://bertrandmeyer.com/2026/03/16/celebrating-tony-hoares-mark-on-computer-science/
102•benhoyt•12h ago•28 comments

A ngrok-style secure tunnel server written in Rust and Open Source

https://github.com/joaoh82/rustunnel
45•joaoh82•4h ago•16 comments

A dither generator for triangular and hexagonal pixels (2025)

https://danieltemkin.com/DitherStudies
4•strombolini•4d ago•0 comments

Spotify playing ads for paid subscribers

15•IncandescentGas•1h ago•6 comments

The pleasures of poor product design

https://www.inconspicuous.info/p/the-pleasures-of-poor-product-design
223•NaOH•17h ago•78 comments

Ndea (YC W26) is hiring a symbolic RL search guidance lead

https://ndea.com/jobs/search-guidance
1•mikeknoop•11h ago

North Korean's 100k fake IT workers net $500M a year for Kim

https://www.theregister.com/2026/03/18/researchers_lift_the_lid_on/
88•speckx•2h ago•84 comments

A Fuzzer for the Toy Optimizer

https://bernsteinbear.com/blog/toy-fuzzer/
13•surprisetalk•1d ago•1 comments