frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Loading Pydantic models from JSON without running out of memory

https://pythonspeed.com/articles/pydantic-json-memory/
134•itamarst•4mo ago

Comments

thisguy47•4mo ago
I'd like to see a comparison of ijson vs just `json.load(f)`. `ujson` would also be interesting to see.
itamarst•4mo ago
For my PyCon 2025 talk I did this. Video isn't up yet, but slides are here: https://pythonspeed.com/pycon2025/slides/

The linked-from-original-article ijson article was the inspiration for the talk: https://pythonspeed.com/articles/json-memory-streaming/

tomrod•4mo ago
I have a side question -- what did you use for slides?
itamarst•4mo ago
https://remarkjs.com/
fjasdfas•4mo ago
So are there downsides to just always setting slots=True on all of my python data types?
itamarst•4mo ago
You can't add extra attributes that weren't part of the original dataclass definition:

  >>> from dataclasses import dataclass
  >>> @dataclass
  ... class C: pass
  ... 
  >>> C().x = 1
  >>> @dataclass(slots=True)
  ... class D: pass
  ... 
  >>> D().x = 1
  Traceback (most recent call last):
    File "<python-input-4>", line 1, in <module>
      D().x = 1
      ^^^^^
  AttributeError: 'D' object has no attribute 'x' and no __dict__ for setting new attributes
Most of the time this is not a thing you actually need to do.
masklinn•4mo ago
Also some of the introspection stops working e.g. vars().

If you're using dataclasses it's less of an issue because dataclasses.asdict.

monomial•4mo ago
I rarely need to dynamically add attributes myself on dataclasses like this but unfortunately this also means things like `@cached_property` won't work because it can't internally cache the method result anywhere.
franga2000•4mo ago
IIRC you can just include a __dict__ slot and @cached_property should start working again. I
jmugan•4mo ago
My problem isn't running out of memory; it's loading in a complex model where the fields are BaseModels and unions of BaseModels multiple levels deep. It doesn't load it all the way and leaves some of the deeper parts as dictionaries. I need like almost a parser to search the space of different loads. Anyone have any ideas for software that does that?
causasui•4mo ago
You probably want to use Discriminated Unions https://docs.pydantic.dev/latest/concepts/unions/#discrimina...
jmugan•4mo ago
Yeah, I'm doing that
enragedcacti•4mo ago
The only reason I can think of for the behavior you are describing is if one of the unioned types at some level of the hierarchy is equivalent to Dict[str, Any]. My understanding is that Pydantic will explore every option provided recursively and raise a ValidationError if none match but will never just give up and hand you a partially validated object.

Are you able to share a snippet that reproduces what you're seeing?

jmugan•4mo ago
That's an interesting idea. It's possible there's a Dict[str,Any] in there. And yeah, my assumption was that it tried everything recursively, but I just wasn't seeing that, and my LLM council said that it did not. But I'll check for a Dict[str,Any]. Unfortunately, I don't have a minimal example, but making one should be my next step.
enragedcacti•4mo ago
One thing to watch out for while you debug is that the default 'smart' mode for union discrimination can be very unintuitive. As you can see in this example, an int vs a string can cause a different model to be chosen two layers up even though both are valid. You may have perfectly valid uses of Dict within your model that are being chosen in error because they result in less type coercion. left_to_right mode (or ideally discriminated unions if your data has easy discriminators) will be much more consistent.

    >>> class A(BaseModel):
    >>>     a: int
    >>> class B(BaseModel):
    >>>     b: A
    >>> class C(BaseModel):
    >>>     c: B | Dict[str, Any]

    >>> C.model_validate({'c':{'b':{'a':1}}})
    
    C(c=B(b=A(a=1)))

    >>> C.model_validate({'c':{'b':{'a':"1"}}})

    C(c={'b': {'a': '1'}})

    >>> class C(BaseModel):
    >>>     c: B | Dict[str, Any] = Field(union_mode='left_to_right')
    
    >>> C.model_validate({'c':{'b':{'a':"1"}}})

    C(c=B(b=A(a=1)))
cbcoutinho•4mo ago
At some point, we have to admit we're asking too much from our tools.

I know nothing about your context, but in what context would a single model need to support so many permutations of a data structure? Just because software can, doesn't mean it should.

shakna•4mo ago
Anything multi-tenant? There's a reason Salesforce is used for so many large organisations. The multi-nesting lets you account for all the descrepancies that come with scale.

Just tracking payments through multiple tax regions will explode the places where things need to be tweaked.

not_skynet•4mo ago
going to shamelessly plug my own library here: https://github.com/mivanit/ZANJ

You can have nested dataclasses, as well as specify custom serializers/loaders for things which aren't natively supported by json.

jmugan•4mo ago
Ah, but I need something JSON-based.
not_skynet•4mo ago
It does allow dumping to/recovering from json, apologies if that isn't well documented.

Calling `x: str = json.dumps(MyClass(...).serialize())` will get you json you can recover to the original object, nested classes and custom types and all, with `MyClass.load(json.loads(x))`

m_ke•4mo ago
Or just dump pydantic and use msgspec instead: https://jcristharif.com/msgspec/
itamarst•4mo ago
msgspec is much more memory efficient out of the box, yes. Also quite fast.
mbb70•4mo ago
A great feature of pydantic are the validation hooks that let you intercept serialization/deserialization of specific fields and augment behavior.

For example if you are querying a DB that returns a column as a JSON string, trivial with Pydantic to json parse the column are part of deser with an annotation.

Pydantic is definitely slower and not a 'zero cost abstraction', but you do get a lot for it.

jtmcivor•4mo ago
One approach to do that in msgspec is described here https://github.com/jcrist/msgspec/issues/375#issuecomment-15...
aitchnyu•4mo ago
Can it do incremental parsing? Cant tell from a brief look.
jtmcivor•4mo ago
IIUC:

* You still need to load all the bytes into memory before passing to msgspec decoding

* You can decode a subset of fields, which is really helpful

* Reusing msgspec decoders saves some cpu cycles https://jcristharif.com/msgspec/perf-tips.html#reuse-encoder...

Slides 17, 18, 19 have an example of the first two points https://pythonspeed.com/pycon2025/slides/#17

zxilly•4mo ago
Maybe using mmap would also save some memory, I'm not quite sure if this can be implemented in Python.
itamarst•4mo ago
Once you switch to ijson it will not save any memory, no, because ijson essentially uses zero memory for the parsing. You're just left with the in-memory representation.
dgan•4mo ago
i gave up on python dataclasses & json. Using protobufs object within the application itself. I also have a "...Mixin" class for almost every wire model, with extra methods

Automatic, statically typed deserialization is worth the trouble in my opinion

fidotron•4mo ago
Having only recently encountered this, does anyone have any insight as to why it takes 2GB to handle a 100MB file?

This looks highly reminiscent (though not exactly the same, pedants) of why people used to get excited about using SAX instead of DOM for xml parsing.

itamarst•4mo ago
I talk about this more explicitly in the PyCon talk (https://pythonspeed.com/pycon2025/slides/ - video soon) though that's not specifically about Pydantic, but basically:

1. Inefficient parser implementation. It's just... very easy to allocate way too much memory if you don't think about large-scale documents, and very difficult to measure. Common problem with many (but not all) JSON parsers.

2. CPython in-memory representation is large compared to compiled languages. So e.g. 4-digit integer is 5-6 bytes in JSON, 8 in Rust if you do i64, 25ish in CPython. An empty dictionary is 64 bytes.

cozzyd•4mo ago
Funny to see awkward array in this context! (And... do people really store giant datasets in json?!?).
jfb•4mo ago
My sweet summer child
chao-•4mo ago
Often the legacy of an engineer (or team) who "did what they had to do" to meet a deadline, and if they wanted to migrate to something better post-launch, weren't allowed to allocate time to go back and do so.

At least JSON or CSV is better than the ad hoc homegrown formats you found at medium-sized companies that came out of the 90's and 00's.

ljm•4mo ago
Some people even use AI-generated JSON as a semantic layer over their SQL.
CJefferson•4mo ago
To take 2GB to parse a 100MB file, we increase file size 20x

Let's imagine the file is mostly full of single digit numbers with no spaces (so lists like 2,4,1,0,9,3...).

We need to spend 40 bytes storing a number.

Make a minimal sized class to store an integer:

    class JsonInt:
        x = 1
That object's size is already 48 bytes.

Usually we store floats from JSON, the size of 1 as a float in python is 24 bytes.

Now, you can get smaller, but as soon as you introduce any kind of class structure or not parsing numbers until they are used (in case you want people to be able to intrepret them as ints or floats), you blow through 20x memory size increase.

fidotron•4mo ago
> We need to spend 40 bytes storing a number.

But . . . why? Assuming they aren't BigInts or similar these are maximum 8 bytes of actual data. This overhead is ridiculous.

Using classes should enable you to be much smaller than the JSON representation, not larger. For example, V8 does it like https://v8.dev/docs/hidden-classes

> not parsing numbers until they are used

Doesn't this defeat the point of pydantic? It's supposed to be checking the model is valid as it's loaded using jiter. If the data is valid it can be loaded into an efficient representation, and if it's not the errors can be emitted during iterating over it.

jerf•4mo ago
"But . . . why?"

This is CPython. This is how it works. It's not particularly related to JSON. That sort of overhead is put on everything. It just hurts the most when the thing you're putting the overhead on is a single integer. It hurts less when you're doing it to, say, a multi-kilobyte string.

Even in your v8 example, that's a JIT optimization, not "how the language works". You break that optimization, which you can do at any moment with any change in your code base, you're back to similar sizes.

Boxing everything lets you easily implement the dynamic scripting language's way of treating everything as an Object of some sort, but it comes at a price. There's a reason dynamic scripting languages, even after the JIT has come through, are generally substantially slower languages. This isn't the only reason, but it's a significant part of it.

fidotron•4mo ago
> Even in your v8 example, that's a JIT optimization, not "how the language works". You break that optimization, which you can do at any moment with any change in your code base, you're back to similar sizes.

The whole point of the v8 optimization is it works in the face of prototype chains that merge etc. as you add new fields dynamically so if you change your code base it adapts.

deepsquirrelnet•4mo ago
Alternatively, if you had to go with json, you could consider using jsonl. I think I’d start by evaluating whether this is a good application for json. I tend to only want to use it for small files. Binary formats are usually much better in this scenario.
kayson•4mo ago
How does the speed of the dataclass version compare?
scolvin•4mo ago
Pydantic author here. We have plans for an improvement to pydantic where JSON is parsed iteratively, which will make way for reading a file as we parse it. Details in https://github.com/pydantic/pydantic/issues/10032.

Our JSON parser, jiter (https://github.com/pydantic/jiter) already supports iterative parsing, so it's "just" a matter of solving the lifetimes in pydantic-core to validate as we parse.

This should make pydantic around 3x faster at parsing JSON and significantly reduce the memory overhead.

Lucasoato•4mo ago
Pydantic is a life changing library, thanks so much for your work!
adeeshaek•4mo ago
Seconded. Please keep up the awesome work!
itamarst•4mo ago
That's great! Would also be cool (separately from Pydantic use case) to add jiter backend to ijson.

Replacement.ai

https://replacement.ai
337•wh313•1h ago•130 comments

Abandoned land drives dangerous heat in Houston, Texas A&M study finds

https://stories.tamu.edu/news/2025/10/07/abandoned-land-drives-dangerous-heat-in-houston-texas-am...
57•PaulHoule•2h ago•45 comments

Websites Are for Humans

https://marcus-obst.de/blog/websites-are-for-humans
28•freediver•1h ago•10 comments

Show HN: Duck-UI – Browser-Based SQL IDE for DuckDB

https://demo.duckui.com
101•caioricciuti•4h ago•31 comments

Show HN: Pyversity – Fast Result Diversification for Retrieval and RAG

https://github.com/Pringled/pyversity
12•Tananon•1h ago•1 comments

How to Assemble an Electric Heating Element from Scratch

https://solar.lowtechmagazine.com/2025/10/how-to-build-an-electric-heating-element-from-scratch/
22•surprisetalk•2h ago•9 comments

The case for the return of fine-tuning

https://welovesota.com/article/the-case-for-the-return-of-fine-tuning
81•nanark•6h ago•37 comments

Xubuntu.org Might Be Compromised

https://old.reddit.com/r/Ubuntu/comments/1oa4549/xubuntuorg_might_be_compromised/
95•kekqqq•1h ago•19 comments

Why an abundance of choice is not the same as freedom

https://aeon.co/essays/why-an-abundance-of-choice-is-not-the-same-as-freedom
29•herbertl•48m ago•4 comments

Improving PixelMelt's Kindle Web Deobfuscator

https://shkspr.mobi/blog/2025/10/improving-pixelmelts-kindle-web-deobfuscator/
44•ColinWright•3h ago•11 comments

Lost Jack Kerouac story found among assassinated mafia boss' belongings

https://www.sfgate.com/sf-culture/article/lost-jack-kerouac-chapter-found-mafia-boss-estate-21098...
50•rmason•4d ago•24 comments

The Zipper Is Getting Its First Major Upgrade in 100 Years

https://www.wired.com/story/the-zipper-is-getting-its-first-major-upgrade-in-100-years/
12•bookofjoe•29m ago•3 comments

OpenAI researcher announced GPT-5 math breakthrough that never happened

https://the-decoder.com/leading-openai-researcher-announced-a-gpt-5-math-breakthrough-that-never-...
219•Topfi•4h ago•139 comments

EQ: A video about all forms of equalizers

https://www.youtube.com/watch?v=CLAt95PrwL4
219•robinhouston•1d ago•61 comments

A Tower on Billionaires' Row Is Full of Cracks. Who's to Blame?

https://www.nytimes.com/2025/10/19/nyregion/432-park-avenue-condo-tower.html
55•danso•3h ago•29 comments

A Bright HDR Image

https://walzr.com/HDR2.jpg
18•walz•3d ago•10 comments

Titan submersible’s $62 SanDisk memory card found undamaged at wreckage site

https://www.tomshardware.com/pc-components/microsd-cards/tragic-oceangate-titan-submersibles-usd6...
396•WithinReason•2d ago•191 comments

With deadline looming 4 of 9 universities reject Trumps pact to remake higher ed

https://arstechnica.com/culture/2025/10/with-deadline-looming-4-of-9-universities-reject-trumps-c...
14•Bender•31m ago•0 comments

Jupyter Collaboration has a history slider

https://blog.jupyter.org/exploring-a-documents-timeline-in-jupyterlab-6084f96db263
44•fghorow•6d ago•10 comments

Chen-Ning Yang, Nobel laureate, dies at 103

https://www.chinadaily.com.cn/a/202510/18/WS68f3170ea310f735438b5bf2.html
266•nhatcher•1d ago•68 comments

How one of the longest dinosaur trackways in the world was uncovered in the UK

https://www.bbc.co.uk/news/resources/idt-5f8c77b0-92bc-40f2-bf21-6793abbe5ffe
40•6LLvveMx2koXfwn•5d ago•6 comments

Pebble is officially back on iOS and Android

https://twitter.com/ericmigi/status/1979576965494710564
79•vlod•3h ago•9 comments

Root System Drawings

https://images.wur.nl/digital/collection/coll13/search
383•bookofjoe•1d ago•76 comments

The Accountability Problem

https://www.jamesshore.com/v2/blog/2025/the-accountability-problem
106•FrancoisBosun•13h ago•43 comments

Feed me up, Scotty – custom RSS feed generation using CSS selectors

https://feed-me-up-scotty.vincenttunru.com/
3•diymaker•2h ago•1 comments

GoGoGrandparent (YC S16) Is Hiring Back End and Full-Stack Engineers

1•davidchl•14h ago

How to sequence your DNA for <$2k

https://maxlangenkamp.substack.com/p/how-to-sequence-your-dna-for-2k
221•yichab0d•19h ago•94 comments

ISP Blocking of No-IP's Dynamic DNS Enters Week 2

https://torrentfreak.com/isp-blocking-of-no-ips-dynamic-dns-enters-week-2-251019/
13•HotGarbage•1h ago•0 comments

When you opened a screen shot of a video in Paint, the video was playing in it

https://devblogs.microsoft.com/oldnewthing/20251014-00/?p=111681
362•birdculture•2d ago•64 comments

BQN "Macros" with •Decompose (2023)

https://saltysylvi.github.io/blog/bqn-macros.html
19•ofalkaed•1w ago•3 comments