The overridden space is never recovered, causing buffer size
to grow indefinitely.
Is the garbage at least zeroed? Otherwise seems like it could "leak" overwritten values when sending whole buffers via memcpyDon't get me wrong, I find this type of data structures interesting and useful, but it's misleading to call it "serialization", unless my understanding is wrong.
Apache Arrow is trying to do something similar, using Flatbuffer to serialize with zero-copy and zero-parse semantics, and an index structure built on top of that.
Would love to see comparisons with Arrow
cryptonector•6d ago
Perhaps I should have posted this URI instead: https://lite3.io/design_and_limitations.html
Lite^3 deserves to be noticed by HN. u/eliasdejong (the author) posted it 23 days ago but it didn't get very far. I'm hoping this time it gets noticed.
eric-p7•3h ago
"outperforms the fastest JSON libraries (that make use of SIMD) by up to 120x depending on the benchmark. It also outperforms schema-only formats, such as Google Flatbuffers (242x). Lite³ is possibly the fastest schemaless data format in the world."
^ This should be a bar graph at the top of the page that shows both serializing sizes and speeds.
It would also be nice to see a json representation on the left and a color coded string of bytes on the right that shows how the data is packed.
Then the explanation follows.
Someone•50m ago
FTA#2: “Object keys (think JSON) are hashed to a 4-byte digest and stored inside B-tree nodes”
It still will likely be faster because of better cache locality, but doesn’t that means this also does not (efficiently) support range queries?
That page also says
“tree traversal inside the critical path can be satisfied entirely using fixed 4-byte word comparisons, never actually requiring string comparisons except for detection of hash collisions. This design choice alone contributes to much of the runtime performance of Lite³.”
How can that be true, given that this beats libraries that use hash maps, that also rarely require string comparisons, by a large margin?
Finally, https://lite3.io/design_and_limitations.html#autotoc_md37 says:
“Inserting a colliding key will not corrupt your data or have side effects. It will simply fail to insert.”
I also notice this uses the DJB2 hash function, which has hash collisions between short strings (http://dmytry.blogspot.com/2009/11/horrible-hashes.html), and those are more likely to be present in json documents. You get about 8 + 3 × 5 = 23 bits of hash for four-character strings, for example, increasing the risk of collisions to, ballpark, about one in three thousand.
=> I think that needs fixing before this can be widely used.
nneonneo•9m ago
It's a bit unfortunate that the wire format is tied to a specific hash function. It also means that the spec will ossify around a specific hash function, which may not end up being the optimal choice. Neither JSON nor Protobuf have this limitation. One way around this would be to ditch the hashing and use the keys for the b-tree directly. It might be worth benchmarking - I don't think it's necessarily any slower, and an inline cache of key prefixes (basically a cheapo hash using the first N chars) should help preserve performance for common cases.