frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Best way to annotate large parquet LLM logs without full rewrites?

2•platypii•2h ago
I asked this on the Apache mailing list but haven’t found a good solution yet. Wondering if anyone has some ideas for how to engineer this?

Here’s my problem: I have gigabytes of LLM conversation logs in parquet in S3. I want to add per-row annotations (llm-as-a-judge scores), ideally without touching the original text data.

So for a given dataset, I want to add a new column. This seemed like a perfect use case for Iceberg. Iceberg does let you evolve the table schema, including adding a column. BUT you can only add a column with a default value. If I want to fill in that column with annotations, ICEBERG MAKES ME REWRITE EVERY ROW. So despite being based on parquet, a column-oriented format, I need to re-write the entire source text data (gigabytes of data) just to add ~1mb of annotations. This feels wildly inefficient.

I considered just storing the column in its own table and then joining them. This does work but the joins are annoying to work with, and I suspect query engines do not optimize well a "join on row_number" operation.

I've been exploring using little-known features of parquet like the file_path field to store column data in external files. But literally zero parquet clients support this.

I'm running out of ideas for how to work with this data efficiently. It's bad enough that I am considering building my own table format if I can’t find a solution. Anyone have suggestions?

'People Over Billionaires': SF march through wealthy neighborhood targets the 1%

https://www.sfchronicle.com/bayarea/article/people-over-billionaires-march-21170259.php
1•donsupreme•2m ago•0 comments

US bars approvals of new models of DJI, all other foreign drones

https://www.reuters.com/business/aerospace-defense/us-adds-dji-other-foreign-drones-national-secu...
1•bookofjoe•3m ago•0 comments

GitHub Is Down

https://statusgator.com/services/github
4•erzhan89•3m ago•0 comments

Open AI Europe terms of use

https://openai.com/en-GB/policies/eu-terms-of-use/
1•oli5679•3m ago•0 comments

Infinity Ward: Rest in Peace Vince

https://twitter.com/i/status/2003224882688283000
1•throwaway2027•3m ago•0 comments

Starlink Satellite 35956 is largely intact

https://twitter.com/michaelnicollsx/status/2002419447521562638
1•perihelions•4m ago•0 comments

Definitions Are Subjective

https://garden.pranavmandhare.com/Vibhuti/Essays/%27Definitions%27-are-Subjective
1•pranavm27•5m ago•0 comments

GitHub Is Down

https://downdetector.com/status/github/
2•devotee•5m ago•0 comments

Profunctors, Arrows, & Static Analysis

https://elvishjerricco.github.io/2017/03/10/profunctors-arrows-and-static-analysis.html
1•ChadNauseam•5m ago•0 comments

Transformers Are Dead. Google Killed Them – Then Went Silent

https://medium.com/@aedelon/transformers-are-dead-google-killed-them-then-went-silent-a379ed35409b
1•washedup•6m ago•0 comments

Design-time governance for AI agents

https://github.com/thetpmguy/agent-compiler
1•rahi171990•8m ago•0 comments

Tc – Theodore Calvin's language-agnostic testing framework

https://github.com/ahoward/tc
2•mooreds•11m ago•0 comments

Winter Break: December 22nd to January 2nd

https://stratechery.com/2025/winter-break-december-22nd-to-january-2nd/
1•feross•12m ago•0 comments

Ask HN: 2026 books recommendation for deeper insights into niche crucial topics

1•ahmedfromtunis•13m ago•0 comments

Deduplicating Parallel Queries in TanStack Query (React Query)

https://matthuggins.com/blog/posts/deduplicating-parallel-queries-in-tanstack-query-react-query
2•matthuggins•14m ago•0 comments

Google's healthcare AI made up a body part – what if doctors don't notice?

https://www.theverge.com/health/718049/google-med-gemini-basilar-ganglia-paper-typo-hallucination
2•tessierashpool9•14m ago•1 comments

Txtempus: Radio time station transmitter using the Raspberry Pi

https://github.com/hzeller/txtempus
1•beala•24m ago•0 comments

Show HN: The SkiBlackBox–100% offline AI ski coach, zero data leaves your phone

https://theskiblackbox.com
2•skicoachapp•25m ago•0 comments

Malicious Chrome Extensions "Phantom Shuttle" Masquerade as a VPN to Intercept

https://socket.dev/blog/malicious-chrome-extensions-phantom-shuttle
1•feross•27m ago•0 comments

Scientists Map the Human Genome in 4D

https://news.feinberg.northwestern.edu/2025/12/22/scientists-map-the-human-genome-in-4d/
1•geox•27m ago•0 comments

Bloom: An open source tool for automated behavioral evaluations

https://alignment.anthropic.com/2025/bloom-auto-evals/
2•sonabinu•28m ago•0 comments

Your Year with ChatGPT

https://www.chatgpt.com/?q=YourYearWithChatGPT
1•FergusArgyll•29m ago•0 comments

DOJ uploaded a 12-SEC video showing Epstein attempting suicide?

https://twitter.com/rtwlz/status/2003211685650374823
11•dvrp•30m ago•7 comments

Bias Is Ruining Your Life (Here's Why) [video]

https://www.youtube.com/watch?v=QFqUSYTylFU
1•saltysalt•31m ago•0 comments

Brimar thermionic products great British valve project

https://brimaruk.com/menugbvp/about-the-gbvp/
1•fanf2•32m ago•0 comments

Boys at her school shared AI-generated, nude images of her. She was expelled

https://abcnews.go.com/US/wireStory/boys-school-shared-ai-generated-nude-images-after-128611202
4•randycupertino•34m ago•5 comments

Rational and Irrational Belief in the Hot Hand: Evidence from "Jeopardy "

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5062536
1•PaulHoule•35m ago•0 comments

Comprehensive Migration Guide for Ingress Nginx Controller Retirement

https://ingressnginxmigration.org/
2•simjue•37m ago•0 comments

Passkeys Explained [video]

https://www.youtube.com/watch?v=xYfiOnufBSk
2•jonbaer•37m ago•0 comments

See sunrise and sunset lines overlaid on any street map

https://www.suncalc.org
2•robinwarren•37m ago•0 comments