Lite^3, a JSON-compatible zero-copy serialization format

152•cryptonector•1mo ago

https://lite3.io/design_and_limitations.html

See also Show HN: Lite³ – A JSON-Compatible Zero-Copy Serialization Format in 9.3 kB of C - https://news.ycombinator.com/item?id=45992832 (no comments, but a good writeup)

Comments

cryptonector•1mo ago

Lite^3 is a clever encoding for JSON data that is indexed as-encoded and is mutable in place.

Perhaps I should have posted this URI instead: https://lite3.io/design_and_limitations.html

Lite^3 deserves to be noticed by HN. u/eliasdejong (the author) posted it 23 days ago but it didn't get very far. I'm hoping this time it gets noticed.

eric-p7•1mo ago

This needs more attention than it's getting. Perhaps if you made some changes to the landing pages could help?

"outperforms the fastest JSON libraries (that make use of SIMD) by up to 120x depending on the benchmark. It also outperforms schema-only formats, such as Google Flatbuffers (242x). Lite³ is possibly the fastest schemaless data format in the world."

^ This should be a bar graph at the top of the page that shows both serializing sizes and speeds.

It would also be nice to see a json representation on the left and a color coded string of bytes on the right that shows how the data is packed.

Then the explanation follows.

sirfz•1mo ago

As already mentioned in other comments, it doesn't really make sense to compare to json parsers since lite3 parses, well, lite3 and not json. It serves a different use case and I think focusing on performance vs json (especially json parsers) is not the best thing about this project

Someone•1mo ago

FTA#1: “Hashmaps do not (efficiently) support range queries. Since the keys are stored in pseudorandom order”

FTA#2: “Object keys (think JSON) are hashed to a 4-byte digest and stored inside B-tree nodes”

It still will likely be faster because of better cache locality, but doesn’t that means this also does not (efficiently) support range queries?

That page also says

“tree traversal inside the critical path can be satisfied entirely using fixed 4-byte word comparisons, never actually requiring string comparisons except for detection of hash collisions. This design choice alone contributes to much of the runtime performance of Lite³.”

How can that be true, given that this beats libraries that use hash maps, that also rarely require string comparisons, by a large margin?

Finally, https://lite3.io/design_and_limitations.html#autotoc_md37 says:

“Inserting a colliding key will not corrupt your data or have side effects. It will simply fail to insert.”

I also notice this uses the DJB2 hash function, which has hash collisions between short strings (http://dmytry.blogspot.com/2009/11/horrible-hashes.html), and those are more likely to be present in json documents. You get about 8 + 3 × 5 = 23 bits of hash for four-character strings, for example, increasing the risk of collisions to, ballpark, about one in three thousand.

=> I think that needs fixing before this can be widely used.

nneonneo•1mo ago

Looking at the actual code (https://github.com/fastserial/lite3/blob/main/src/lite3.c#L2...), it seems like it performs up to 128 probes to find a target before failing, rather than bailing immediately if a collision is detected. It seems like maybe the documentation needs to be updated?

It's a bit unfortunate that the wire format is tied to a specific hash function. It also means that the spec will ossify around a specific hash function, which may not end up being the optimal choice. Neither JSON nor Protobuf have this limitation. One way around this would be to ditch the hashing and use the keys for the b-tree directly. It might be worth benchmarking - I don't think it's necessarily any slower, and an inline cache of key prefixes (basically a cheapo hash using the first N chars) should help preserve performance for common cases.

Someone•1mo ago

> It seems like maybe the documentation needs to be updated

Looks like it, yes:

  /**
    Enable hash probing to tolerate 32-bit hash collisions.

    Hash probing configuration (quadratic open addressing for 32-bit hashes:
    h_i = h_0 + i^2)

    Limit attempts with `LITE3_HASH_PROBE_MAX` (defaults to 128).
    Probing cannot be disabled.
  */
  #ifndef LITE3_HASH_PROBE_MAX
    #define LITE3_HASH_PROBE_MAX 128U
  #endif

  #if LITE3_HASH_PROBE_MAX < 2
    #error "LITE3_HASH_PROBE_MAX must be >= 2"
  #endif

> It also means that the spec will ossify around a specific hash function

It is a bit ugly, and will break backwards compatibility, but supporting a second hash function isn’t too hard.

You can, on load, hash a few keys, compare them to the hashes, and, from that, if the input has many keys with high probability, infer which hash function was used.

There also might be spare bit somewhere to indicate ‘use the alternative hash function’.

Reading the code (nice-looking, BTW, for C code, but since it is C code, also full of warnings that other languages can protect you from) I spotted this (https://github.com/fastserial/lite3/blob/acbb97984eca1183ddc...):

> The JSON standard requires that the root-level type always be an ‘object' > or 'array'. This also applies to Lite³.

I don’t think that is true, and https://www.json.org/json-en.html agrees with that. Single values (numbers, strings, booleans, null) also are valid json.

rurban•1mo ago

Yes, the latest json rfc made it insecure. Prone to trivial man in the middle attacks.

Wise people ignore that.

dang•1mo ago

I've added that second link to the toptext.

I'm sorry we missed that Show HN (https://news.ycombinator.com/item?id=45992832)! It belonged in the SCP (https://news.ycombinator.com/item?id=26998308).

al2o3cr•1mo ago

The docs mention that space for overwritten variable-sized values in the buffer is not reclaimed:

    The overridden space is never recovered, causing buffer size
    to grow indefinitely.

Is the garbage at least zeroed? Otherwise seems like it could "leak" overwritten values when sending whole buffers via memcpy

mjd•1mo ago

“By default, deleted values are overwritten with NULL bytes (0x00). This is a safety feature since not doing so would leave 'deleted' entries intact inside the datastructure until they are overwritten by other values. If the user wishes to maximize performance at the cost of leaking deleted data, LITE3_ZERO_MEM_DELETED should be disabled.”

rixed•1mo ago

So it's not really a serialization format, it's a compact, modifiable untyped tree, that one can therefore send to another machine with the same architecture. Or deserialise into native language specific data structures.

Don't get me wrong, I find this type of data structures interesting and useful, but it's misleading to call it "serialization", unless my understanding is wrong.

koolala•1mo ago

You have to encode the type of all the binary data. Does that make it serialization?

bawolff•1mo ago

I'm not sure what the distinction you are trying to make here is?

How does machine architecture play into it? It sounds like int sizes are the same regardless of word sizes of the machine, the choices made just happen to have high performance for common machine architectures. Or is it about endianess? Do big endian machines even exist anymore?

rixed•1mo ago

Yes, integer sizes, float sizes, endianess, alignment requirement...

jesse__•1mo ago

What is a serialization format, if not a data encoding "that one can therefore send to another machine" .. "Or deserialise into native language specific data structurs" ..?

I'm very confused by your comment.

koolala•1mo ago

GLTF is like this too (or PLY)? The main difference is the format of their headers? Just by reading the header you can parse the binary data. I'm surprised BSON and any of the other binary JSON formats they list don't support reading the memory layout in a header.

lsb•1mo ago

This is super interesting!

Apache Arrow is trying to do something similar, using Flatbuffer to serialize with zero-copy and zero-parse semantics, and an index structure built on top of that.

Would love to see comparisons with Arrow

willtemperley•1mo ago

Arrow has a different use case I think. Lite3 / TRON is effectively more efficient JSON. Arrow uses an array per property. This allows zero copy per property access across TB scale datasets amongst other useful features - it’s more like the core of a database.

A closer comparison would be to FlatBuffers which is used by Arrow IPC, a major difference being TRON is schemaless.

tarasglek•1mo ago

hash collision limitation for keys is the most questionable part of design. Usually thats handled by forcing key lookup to verify that what you looked up matches what you tried to lookup. Resolving this perf hit is probably doable by having an extra table of conflicting hashes

eliasdejong•1mo ago

(author here)

The documentation page is out of date, the format now resolves collisions through quadratic probing.

bawolff•1mo ago

This is cool, but the headline makes it sound like the wire format is json compatible which is not the case. I'm not really sure why there is a focus on json here at all - its the least interesting part of this and the same could be said for almost every serialization format.

sudodevnull•1mo ago

It's a direct support in binary for the primitives available in JSON. I feel like I'm taking crazy pills

bawolff•1mo ago

Its not that "direct", its just a serialization format that supports a super-set of json types. Json tends to be lowest common denominator, so that describes most serialization formats. Which is why when i initially read the headline i thought it meant the wire format was json compatible, which would be really unusual and extremely impressive, because otherwise why else would anyone mention json compatibility? its like mentioning the format is for use with computers, it goes without saying.

That's not a criticism of anything with lite3, lite3 sounds really cool. The json angle is just an odd part to concentrate on.

Jean-Papoulos•1mo ago

This is nice, but please don't clickbait headlines with straight-up lies. This is not JSON-compatible.

koolala•1mo ago

Yeah JSON compatable is very different from convertable.

eliasdejong•1mo ago

Author here,

First of all, hello Hacker News :)

Many of the comments seem to address the design of key hashing. The reason for using hashed keys inside B-tree nodes instead of the string keys directly is threefold:

1) The implementation is simplified.

2) When performing a lookup, it is faster to compare fixed-sized elements than it is to do variable length string comparison.

3) The key length is unlimited.

I should say the documentation page is out of date regarding hash collisions. The format now supports probing thanks to a PR merged yesterday. So inserting colliding keys will actually work.

It is true that databases and other formats do store string keys directly in the nodes. However as a memory format, runtime performance is very important. There is no disk or IO latency to 'hide behind'.

Right now the hash function used is DJB2. It has the interesting property of somewhat preserving the lexicographical ordering of the key names. So hashes for keys like "item_0001", "item_0002" and "item_0003" are actually more likely to also be placed sequentially inside the B-tree nodes. This can be useful when doing a sequential scan on the semantic key names, otherwise you are doing a lot more random access. Also DJB2 is so simple that it can be calculated entirely by the C preprocessor at compile time, so you are not actually paying the runtime cost of hashing.

We will be doing a lot more testing before DJB2 is finalized in the spec, but might later end up with a 'better' hash function such as XXH32.

Finally, TRON/Lite³ compared to other binary JSON formats (BSON, MsgPack, CBOR, Amazon Ion) is different in that:

1) none of the formats mentioned provide direct zero-copy indexed access to the data

2) none of the formats mentioned allow for partial mutation of the data without rewriting most of the document

This last point 2) is especially significant. For example, JSONB in Postgres is immutable. When replacing or inserting one specific value inside an object or array, with JSONB you will rewrite the entire document as a result of this, even if it is several megabytes large. If you are performing frequent updates inside JSONB documents, this will cause severe write amplification. This is the case for all current Postgres versions.

TRON/Lite³ is designed to blur the line between memory and serialization format.

andreyvit•1mo ago

Hey, I'm sorry, but your Postgres example is completely wrong: because of MVCC, a new version of the data will be stored on every update regardless of the choice of data representation, making the in-place mutability much less of an advantage. (It might be faster than a pair of a compact immutable format + mutable patch layer on top, or it might be slower; the answer ain't immediately obvious to me!)

What you should be imagining instead is a document database entirely built around Lite³-encoded documents, using something like rollback journals instead of MVCC.

We're doing something similar in my company, storing zero-serialization immutable [1] docs in a key-value store (which are read via mmap with zero copying disk-to-usage) and using a mutable [2] overlay patch format for updates. In our analytics use cases, compact storage is very important, in-place mutability is irrelevant (again because of Copy-on-Write at the key-value store level), and the key advantage is zero serialization overhead.

What I'm saying is that Lite³ is a very timely and forward-looking format, but the merging of immutable and mutable formats into one carries tradeoffs that you probably want to discuss, and the discussion into the appropriate use cases is very much worth having.

[1] https://github.com/andreyvit/edb/blob/main/kvo/immutable.go [2] https://github.com/andreyvit/edb/blob/main/kvo/mutable.go

eliasdejong•1mo ago

Hi, you are right in calling out the Postgres example in the context of DBs/MVCC. The purpose of JSONB is to be an indexable representation of JSON inside a Postgres database. It is not trying to be a standalone format for external interchange and therefore it is fulfilling very different requirements.

A serialization format does not care about versioning or rollbacks. It is simply trying to organize data such that it can be sent over a network. If updates can be made in-place without requiring re-serialization, then that is always a benefit.

Write amplification is still a fact however that I think deserves to be mentioned. To tackle this problem in the context of DBs/MVCC, you would have to use techniques other than in-place mutation like you mention: overlay/COW. Basically, LMDB-style.

And yes I think databases is where this technology will eventually have the greatest potential, so that is where I am also looking.

p0w3n3d•1mo ago

That's really impressive. As you wrote it in C it gets automatically compilable to webasm and usable in js. I wonder how Java would behave here... As JNI is not the fastest (used to be not the fastest?)

sushisource•1mo ago

Really cool project. Any plans to port the API to other languages? My use case for something like this is to represent types that are used on either side of FFI (Rust <--> Some other language). Pairing this with a code generator for the shared types would be great. That's something flatbuffers/capnproto do well that isn't just pure speed.

eliasdejong•1mo ago

Yes, language ports are being worked on :)

You will be able to access the same data in different languages using APIs specific to your language.

Right now someone has already made a (private) Go port. And Rust is also in the works.

flakiness•1mo ago

> And Rust is also in the works.

About to troll you on this point. I'm glad I wasn't too hasty. Looking forward to trying it out once it's there!

simonw•1mo ago

I vibe-coded a prototype Python library just to see how it could work - left you an issue comment about that here: https://github.com/fastserial/lite3/issues/6#issuecomment-36... - code is at https://github.com/simonw/lite3/tree/conformance-suite/lite3...

(Update: I used the same process to add a JavaScript library as well: https://github.com/simonw/lite3/tree/conformance-suite/lite3...)

mhalle•1mo ago

It would be interesting to use lite3 for blob storage in or with sqlite.

weitendorf•1mo ago

That's kind of similar to my project collector: https://github.com/accretional/collector

It's protobuf/grpc based but uses json for serialization to make use of sqlite's json filtering functionality. However, it cannot be said to be zero-copy. It serializes binary protos into json and stores the binary protos directly for fast access, which allows you to skip deserialization when pulling out query results

yIt9R8•1mo ago

The benchmarks are flawed, verification is not generally used after serialization with flatbuffers. Deserialization with flatbuffers is a simple reinterpret_cast so it makes no sense for it to be 41.69ms.

It's just dishonest.

IshKebab•1mo ago

I'm suspicious of their FlatBuffers performance comparison.

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes (2023)

Haskell for all: Beyond agentic coding

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

IBM Beam Spring: The Ultimate Retro Keyboard

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

LLMs as the new high level language

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

FDA intends to take action against non-FDA-approved GLP-1 drugs

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Start all of your commands with a comma (2009)

Show HN: Axiomeer – An open marketplace for AI agents

Vouch

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

The silent death of good code

Selection rather than prediction

I write games in C (yes, C) (2016)

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Learning from context is harder than we thought

Reinforcement Learning from Human Feedback

Where did all the starships go?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

SectorC: A C Compiler in 512 bytes (2023)

Haskell for all: Beyond agentic coding

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

IBM Beam Spring: The Ultimate Retro Keyboard

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

LLMs as the new high level language

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

FDA intends to take action against non-FDA-approved GLP-1 drugs

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Start all of your commands with a comma (2009)

Show HN: Axiomeer – An open marketplace for AI agents

Vouch

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

The silent death of good code

Selection rather than prediction

I write games in C (yes, C) (2016)

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Learning from context is harder than we thought

Reinforcement Learning from Human Feedback

Where did all the starships go?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Lite^3, a JSON-compatible zero-copy serialization format

Comments