frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Ask HN: Distributed SQL engine for ultra-wide tables

10•synsqlbythesea•8h ago
I ran into a practical limitation while working on ML feature engineering and multi-omics data.

At some point, the problem stops being “how many rows” and becomes “how many columns”. Thousands, then tens of thousands, sometimes more.

What I observed in practice:

- Standard SQL databases usually cap out around ~1,000–1,600 columns. - Columnar formats like Parquet can handle width, but typically require Spark or Python pipelines. - OLAP engines are fast, but tend to assume relatively narrow schemas. - Feature stores often work around this by exploding data into joins or multiple tables.

At extreme width, metadata handling, query planning, and even SQL parsing become bottlenecks.

I experimented with a different approach: - no joins - no transactions - columns distributed instead of rows - SELECT as the primary operation

With this design, it’s possible to run native SQL selects on tables with hundreds of thousands to millions of columns, with predictable (sub-second) latency when accessing a subset of columns.

On a small cluster (2 servers, AMD EPYC, 128 GB RAM each), rough numbers look like: - creating a 1M-column table: ~6 minutes - inserting a single column with 1M values: ~2 seconds - selecting ~60 columns over ~5,000 rows: ~1 second

I’m curious how others here approach ultra-wide datasets. Have you seen architectures that work cleanly at this width without resorting to heavy ETL or complex joins?

Comments

icsa•5h ago
> With this design, it’s possible to run native SQL selects on tables with hundreds of thousands to millions of columns, with predictable (sub-second) latency when accessing a subset of columns.

What is the design?

remywang•1h ago
What are the columns and why are there so many of them? The standard approach is to explode into many tables and introduce joins as you said. Why don’t you want joins?
anotherpaul•1h ago
I am speculating here but as it genomics data I assume it's information such as: gene count, epigenetic information (methylation, histones etc) Once you do 20k times a few post translational modifications you can come to a few columns quickly.

Usually this would be stored in a sparse long form though. So I might be wrong.

hobs•24m ago
If you want to do that why not just do an EVA pattern or something else that can translate rows to columns?
minitoar•1h ago
ClickHouse and Scuba address this. The core idea is the data layout on disk only requires the scan to open files or otherwise access data for the columns referenced in that query.
kentm•1h ago
What engine and data format were you using for your experiment?

You mention parquet and spark, but I’m wondering if you tried any of the “Lakehouse” formats that are basically parquet + a metadata layer (ie iceberg). I’d probably at least give Trino or Presto a shot, although I suspect that you’ll have similar metadata issues with those engines.

mamcx•1h ago
Yeah, this is a hard problem, in special because Standard SQL databases only partially implement the relational model, have not good recurse for deal with relations-in-relations and lack of ways to (in user space) build your own storage (all stuff that I dream to tackle).

I think the possible answer is to try to "compress" columns with custom datatypes, it could require to touch part of the innards of sql (like in postgreSQL you need to solve it with c) but is a viable option in many cases where you noted that what you could express in json, for example, is in fact a custom type that could be stored efficiently if there is a way to translate it to more primitive types, then solved that the indexes will work.

The second option is to hide part of the join complexity with views.

pedrini210•20m ago
Check the Vortex file format (https://vortex.dev/), if you are interested in a distributed SQL engine then you can check SpiralDB (https://spiraldb.com/), I haven’t used this one personally but they created Vortex.

If you can drop the “distributed” part, then plug DuckDB (https://duckdb.org/) and query Parquet (out of the box) or Vortex (https://duckdb.org/docs/stable/core_extensions/vortex.html) with it.

didgetmaster•10m ago
Is there really a market for these kinds of relational tables?

I created a system to support my custom object store where the metadata tags are stored within key-value stores. I can use them to create relational tables and query them just like conventional row stores used by many popular database engines.

My 'columnar store database' can handle many thousands of columns within a single table. So far, I have only tested it out to 10,000 columns, but it should handle many more.

I can get sub-second query times against it running on a single desktop. I haven't promoted this feature since everyone I have talked to about it, never had a compelling use for it.

The URL shortener that makes your links look as suspicious as possible

https://creepylink.com/
141•dreadsword•2h ago•31 comments

Claude Cowork exfiltrates files

https://www.promptarmor.com/resources/claude-cowork-exfiltrates-files
575•takira•9h ago•238 comments

Furiosa: 3.5x efficiency over H100s

https://furiosa.ai/blog/introducing-rngd-server-efficient-ai-inference-at-data-center-scale
125•written-beyond•5h ago•64 comments

Show HN: Sparrow-1 – Audio-native model for human-level turn-taking without ASR

https://www.tavus.io/post/sparrow-1-human-level-conversational-timing-in-real-time-voice
30•code_brian•12h ago•3 comments

Scaling long-running autonomous coding

https://cursor.com/blog/scaling-agents
165•samwillis•7h ago•79 comments

Ask HN: Share your personal website

505•susam•13h ago•1492 comments

Ask HN: What did you find out or explore today?

40•blahaj•12h ago•28 comments

Project SkyWatch (a.k.a. Wescam at Home)

https://ianservin.com/2026/01/13/project-skywatch-aka-wescam-at-home/
10•jjwiseman•13h ago•2 comments

New Safari developer tools provide insight into CSS Grid Lanes

https://webkit.org/blog/17746/new-safari-developer-tools-provide-insight-into-css-grid-lanes/
15•feross•5h ago•1 comments

Ask HN: How are you doing RAG locally?

61•tmaly•15h ago•20 comments

Bubblewrap: A nimble way to prevent agents from accessing your .env files

https://patrickmccanna.net/a-better-way-to-limit-claude-code-and-other-coding-agents-access-to-se...
54•0o_MrPatrick_o0•4h ago•46 comments

The State of OpenSSL for pyca/cryptography

https://cryptography.io/en/latest/statements/state-of-openssl/
112•SGran•8h ago•19 comments

Ask HN: Weird archive.today behavior?

61•rabinovich•7h ago•17 comments

Ask HN: What is the best way to provide continuous context to models?

32•nemath•4h ago•14 comments

Why some clothes shrink in the wash and how to unshrink them

https://www.swinburne.edu.au/news/2025/08/why-some-clothes-shrink-in-the-wash-and-how-to-unshrink...
482•OptionOfT•4d ago•252 comments

Show HN: Ever wanted to look at yourself in Braille?

https://github.com/NishantJoshi00/dith
19•cat-whisperer•5d ago•9 comments

Show HN: WebTiles – create a tiny 250x250 website with neighbors around you

https://webtiles.kicya.net/
152•dimden•5d ago•23 comments

Show HN: Webctl – Browser automation for agents based on CLI instead of MCP

https://github.com/cosinusalpha/webctl
79•cosinusalpha•15h ago•25 comments

SparkFun Officially Dropping AdaFruit due to CoC Violation

https://www.sparkfun.com/official-response
426•yaleman•15h ago•430 comments

Sun Position Calculator

https://drajmarsh.bitbucket.io/earthsun.html
87•sanbor•8h ago•19 comments

Find a pub that needs you

https://www.ismypubfucked.com/
246•thinkingemote•14h ago•195 comments

ChromaDB Explorer

https://www.chroma-explorer.com/
48•arsentjev•7h ago•3 comments

Generate QR Codes with Pure SQL in PostgreSQL

https://tanelpoder.com/posts/generate-qr-code-with-pure-sql-in-postgres/
68•tanelpoder•4d ago•6 comments

Crafting Interpreters

https://craftinginterpreters.com/
56•tosh•7h ago•8 comments

How can I build a simple pulse generator to demonstrate transmission lines

https://electronics.stackexchange.com/questions/764155/how-can-i-build-a-simple-pulse-generator-t...
30•alphabetter•5d ago•6 comments

Roam 50GB is now Roam 100GB

https://starlink.com/support/article/58c9c8b7-474e-246f-7e3c-06db3221d34d
268•bahmboo•14h ago•313 comments

Is Rust faster than C?

https://steveklabnik.com/writing/is-rust-faster-than-c/
250•vincentchau•4d ago•274 comments

Ford F-150 Lightning outsold the Cybertruck and was then canceled for poor sales

https://electrek.co/2026/01/13/ford-f150-lightning-outsold-tesla-cybertruck-canceled-not-selling-...
538•MBCook•12h ago•711 comments

I Designed a Custom Protocol for My App

https://blog.roj.dev/how-i-designed-a-custom-protocol-for-my-app
4•_roj•2d ago•2 comments

Native ZFS VDEV for Object Storage (OpenZFS Summit)

https://www.zettalane.com/blog/openzfs-summit-2025-mayanas-objbacker.html
100•suprasam•11h ago•29 comments