I would assume because then the shape of the data would be too different? SOAs is super effective when it suits the shape of the data. Here, the difference would be the difference between an OLTP and OLAP DB. And you wouldn't use an OLAP for an OLTP workload?
https://www.postgresql.org/docs/current/storage-page-layout....
I wasn't sure about writing the article in the first place because of that, but I figured it may be interesting anyways because I was kind of happy with how simple it was to write this optimization when it was all done (when I started out with the task I wasn't sure if it would be hard because of how our code is structured, the libraries we use etc.). I originally posted this in the rust community, and it seems people enjoyed the post.
This is basically Rob Pike's Rule 5: If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.(https://users.ece.utexas.edu/~adnan/pike.html)
If anything it's the other way round, if you're not talking about business domain modeling (where data structures first is a valid approach).
I'm saying that if you care about performance, data structures should be designed with approach specific tradeoffs in mind. And like I've said above, in typical business apps, it's ok to start with data structures because (a) performance is usually not a problem, (b) staying close to the domain is cleaner.
But the whole discussion involves knowing how you will use it; the advocacy is for careful consideration of data structures (based on how you will use them) resulting in less pain when designing/choosing algorithms.
> If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.
This is what I was responding to.
"Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious."
I am quite strongly of the opinion that one should essentially never use these for anything that needs to work well at any scale. If you need an industrial strength on-disk format, start with a tool for defining on-disk formats, and map back to your language. This gives you far better safety, portability across languages, and often performance as well.
Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto or even JSON or XML or ASN.1. Note that there are zero programming languages in that list. The right choice is probably not C structs or pickles or some other language’s idea of pickles or even a really cool library that makes Rust do this.
(OMG I just discovered rkyv_dyn. boggle. Did someone really attempt to reproduce the security catastrophe that is Java deserialization in Rust? Hint: Java is also memory-safe, and that has not saved users of Java deserialization from all the extremely high severity security holes that have shown up over the years. You can shoot yourself in the foot just fine when you point a cannon at your foot, even if the cannon has no undefined behavior.)
I think parquet and arrow are great formats, but ultimately they have to solve a similar problem that rkyv solves: for any given type that they support, what does the bit pattern look like in serialized form and in deserialized form (and how do I convert between the two).
However, it is useful to point out that parquet/arrow on top of that solve many more problems needed to store data 'at scale' than rkyv (which is just a serialization framework after all): well defined data and file format, backward compatibility, bloom filters, run length encoding, compression, indexes, interoperability between languages, etc. etc.
SoftTalker•2h ago
Hard disagree. That database table was a waving red flag. I don't know enough/any rust so don't really understand the rest of the article but I have never in my life worked with a database table that had 700 columns. Or even 100.
gz09•2h ago
As to your hard disagree, I guess it depends... While this particular user is on the higher end (in terms of columns), it's not our only user where column counts are huge. We see tables with 100+ columns on a fairly regular basis especially when dealing with larger enterprises.
sublinear•56m ago
If it's not obvious, I agree with the hard disagree. Every time I see a table with that many columns, I have a hard time believing there isn't some normalization possible.
Schemas that stubbornly stick to high-level concepts and refuse to dig into the subfeatures of the data are often seen from inexperienced devs or dysfunctional/disorganized places too inflexible to care much. This isn't really negotiable. There will be issues with such a schema if it's meant to scale up or be migrated or maintained long term.
fiddlerwoaroof•23m ago
Also, normalization solves a problem that’s present in OLTP applications: OLAP/Big Data applications generally have problems that are solved by denormalization.
Mikhail_Edoshin•2h ago
I remember a phrase from one of C. J. Date's books: every record is a logical statement. It really stood out for me and I keep returning to it. Such an understanding implies a rather small number of fields or the logical complexity will go through the roof.
unclad5968•2h ago
ambicapter•2h ago
jayanmn•1h ago
Spivak•25m ago
I've worked on multiple products that have had a concept of "custom fields" who did it this way too.
unclad5968•1h ago
pizza-wizard•1h ago
holden_nelson•2h ago
roblh•1h ago
tdeck•49m ago
bobson381•1h ago
linolevan•43m ago
woah•1h ago
So it sounds like helping customers with databases full of red flags is their bread and butter
gz09•1h ago
Yes that captures it well. Feldera is an incremental query engine. Loosely speaking: it computes answers to any of your SQL queries by doing work proportional to the incoming changes for your data (rather than the entire state of your database tables).
If you have queries that take hours to compute in a traditional database like Spark/PostgreSQL/Snowflake (because of their complexity, or data size) and you want to always have the most up-to-date answer for your queries, feldera will give you that answer 'instantly' whenever your data changes (after you've back-filled your existing dataset into it).
There is some more information about how it works under the hood here: https://docs.feldera.com/literature/papers
nikhilsimha•1h ago
bananamogul•1h ago
randallsquared•1h ago
wombatpm•1h ago
vharuck•1h ago
771 columns (and I've read the definitions for them all, plus about 50 more that have been retired). In the database, these are split across at least 3 tables (registry, patient, tumor). But when working with the records, it's common to use one joined table. Luckily, even that usually fits in RAM.
orthoxerox•43m ago