Parse, Don't Validate (2019)

https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/

101•shirian•2h ago

Comments

seanwilson•46m ago

Maybe I'm missing something and I'm glad this idea resonates, but it feels like sometime after Java got popular and dynamic languages got a lot of mindshare, a large chunk of the collective programming community forgot why strong static type checking was invented and are now having to rediscover this.

In most strong statically typed languages, you wouldn't often pass strings and generic dictionaries around. You'd naturally gravitate towards parsing/transforming raw data into typed data structures that have guaranteed properties instead e.g. a Date object that would throw an exception in the constructor if the string given didn't validate as a date (Edit: Changed this from email because email validation is a can of worms as an example). So there, "parse, don't validate" is the norm and not a tip/idea that would need to gain traction.

bcrosby95•41m ago

In my experience that's pretty rare. Most people pass around string phone numbers instead of a phonenumber class.

Java makes it a pain though, so most code ends up primitive obsessed. Other languages make it easier, but unless the language and company has a strong culture around this, they still usually end up primitive obsessed.

vips7L•40m ago

    record PhoneNumber(String value) {}

Huge pain.

kleiba•31m ago

What have you gained?

jalk•24m ago

An explicit type

dylan604•12m ago

Obviously the pseudo code leaves to the imagination, but what benefits does this give you? Are you checking that it is 10-digits? Are you allowing for + symbols for the international codes?

munk-a•9m ago

That's going to be up to the business building the logic. Ideally those assumptions are clearly encoded in an easily readable manner but at the very least they should be captured somewhere code adjacent (even if it's just a comment and the block of logic to enforce those restraints).

bjghknggkk•2m ago

And parentheses. And spaces (that may, or may not, be trimmed). And all kind of unicode equivalent characters, that might have to be canonicalized. Why not treat it as a byte buffer anyway.

munk-a•14m ago

Without any other context? Nothing - it's just a type alias...

But the context this type of an alias should exist in is one where a string isn't turned into a PhoneNumber until you've validated it. All the functions taking a string that might end up being a PhoneNumber need to be highly defensive - but all the functions taking a PhoneNumber can lean on the assumptions that go into that type.

It's nice to have tight control over the string -> PhoneNumber parsing that guarantees all those assumptions are checked. Ideally that'd be done through domain based type restrictions, but it might just be code - either way, if you're diligent, you can stop being defensive in downstream functions.

seanwilson•4m ago

> All the functions taking a string that might end up being a PhoneNumber need to be highly defensive

Yeah, I can't relate at all with not using a type for this after having to write gross defensive code a couple of times e.g. if it's not a phone number, return -1 or throw an exception? The typed approach is shorter, cleaner, self-documenting, reduces bugs and makes refactoring easier.

yakshaving_jgt•41m ago

It's a design choice more than anything. Haskell's type safety is opt-in — the programmer has to actually choose to properly leverage the type system and design their program this way.

pjerem•40m ago

> In most strong statically typed languages, you wouldn't often pass strings and generic dictionaries around.

In 99% of the projects I worked on my professional life, anything that is coming from an human input is manipulated as a string and most of the time, it stays like this in all of the application layers (with more or less checks in the path).

On your precise exemple, I can even say that I never saw something like an "Email object".

Boxxed•22m ago

Well that's terrifying

tracker1•20m ago

What's funny, is this is exactly one of the reasons I happen to like JavaScript... at its' core, the type coercion and falsy boolean rules work really well (imo) for ETL type work, where you're dealing with potentially untrusted data. How many times have you had to import a CSV with a bad record/row? It seems to happen all the time, why, because people use and manually manipulate data in spreadsheets.

In the end, it's a big part of why I tend to reach for JS/TS first (Deno) for most scripts that are even a little complex to attempt in bash.

jghn•15m ago

I've seen a mix between stringly typed apps and strongly typed apps. The strongly typed apps had an upfront cost but were much better to work with in the long run. Define types for things like names, email address, age, and the like. Convert the strings to the appropriate type on ingest, and then inside your system only use the correct types.

Tomte•2m ago

> email address

Unfortunately, developers usually don‘t look up in RFCs what the syntactic rules are, but use their „common sense“. And so my perfectly valid (main) mail address is rejected in several apps and web sites. Bonus points for allowing me to sign up and then reject at login time.

So, yes, you should create your own type, but if you can‘t be bothered to do it right, please use a string.

rileymichael•2m ago

this is likely an ecosystem sort of thing. if your language gives you the tools to do so at no cost (memory/performance) then folks will naturally utilize those features and it will eventually become idiomatic code. kotlin value classes are exactly this and they are everywhere: https://kotlinlang.org/docs/inline-classes.html

wat10000•28m ago

I'm not sure, maybe a little bit. My own journey started with BASIC and then C-like languages in the 80s, dabbling in other languages along the way, doing some Python, and then transitioning to more statically typed modern languages in the past 10 years or so.

C-like languages have this a little bit, in that you'll probably make a struct/class from whatever you're looking at and pass it around rather than a dictionary. But dates are probably just stored as untyped numbers with an implicit meaning, and optionals are a foreign concept (although implicit in pointers).

Now, I know that this stuff has been around for decades, but it wasn't something I'd actually use until relatively recently. I suspect that's true of a lot of other people too. It's not that we forgot why strong static type checking was invented, it's that we never really knew, or just didn't have a language we could work in that had it.

conartist6•23m ago

I think you're quite right that the idea of "parse don't validate" is (or can be) quite closely tied to OO-style programming.

Essentially the article says that each data type should have a single location in code where it is constructed, which is a very class-based way of thinking. If your Java class only has a constructor and getters, then you're already home free.

Also for the method to be efficient you need to be able to know where an object was constructed. Fortunately class instances already track this information.

Archelaos•11m ago

Strong static type checking is helpful when implementing the methodology described in this article, but it is besides its focus. You still need to use the most restrictive type. For example, uint, instead of int, when you want to exclude negative values; a non-empty list type, if your list should not be empty; etc.

When the type is more complex, specific contraints should be used. For a real live example: I designed a type for the occupation of a hotel booking application. The number of occupants of a room must be positiv and a child must be accompanied by at least one adult. My type Occupants has a constructor Occupants(int adults, int children) that varifies that condition on construction (and also some maximum values).

css_apologist•9m ago

This is an idea that is not ON or OFF

You can get ever so gradually stricter with your types which means that the operations you perform on on a narrow type is even more solid

It is also 100% possible to do in dynamic languages, it's a cultural thing

jackpirate•1m ago

> Edit: Changed this from email because email validation is a can of worms as an example

Email honestly seems much more straightforward than dates... Sweden had a Feb 30 in 1712, and there's all sorts of date ranges that never existed in most countries (e.g. the American colonies skipped September 3-13 in 1752).

macintux•45m ago

A frequent visitor to HN. Tip: if you click on the "past" link under the title (but not the "past" link at the top of the page), you'll trigger a search for previous posts.

https://hn.algolia.com/?query=Parse%2C%20Don%27t%20Validate&...

However, it's more effective to throw quotes into the mix, reduces false positives.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

pcwelder•42m ago

Each repost is worth it.

This, along with John Ousterhout's talk [1] on deep interfaces was transformational for me. And this is coming from a guy who codes in python, so lots of transferable learnings.

[1] https://www.youtube.com/watch?v=bmSAYlu0NcY

curiousgal•37m ago

Semi tangent but I am curious. for those with more experience in python, do you just pass around generic Pandas Dataframes or do you parse each row into an object and write logic that manipulates those instead?

lmeyerov•27m ago

Pass as immutable values, and try to enforce schema (eg, arrow) to keep typed & predictable. This is generally easy by ensuring initial data loads get validated, and then basic testing of subsequent operations goes far.

If python had dependent types, that's how i'd think about them, and keeping them typed would be even easier, eg, nulls sneaking in unexpectedly and breaking numeric columns

When using something like dask, which forces stronger adherence to typings, this can get more painful

yakshaving_jgt•36m ago

I did a lightning talk on this topic last year, with a concrete example in Yesod.

https://www.youtube.com/watch?v=MkPtfPwu3DM

zdw•34m ago

This is a great article, but people often trip over the title and draw unusual conclusions.

The point of the article is about locality of validation logic in a system. Parsing in this context can be thought as consolidating the logic that makes all structure and validity determination about incoming data into one place in the program.

This lets you then rely on the fact that you have valid data in a known structure in all other parts of the program, which don't have to be crufted up with validation logic when used.

Related, it's worth looking at tools that further improve structure/validity locality like protovalidate for protobuf, or Schematron for XML, which allow you to outsource the entire validity checking to library code for existing serialization formats.

jmholla•22m ago

When I came to this idea on my own, I called it "translation at the edge." But for me it was more that just centralizing data validation, it also was about giving you access to all the tools your programming language has for manipulating data.

My main example was working with a co-worker whose application used a number of timestamps. They were passing them around as strings and parsing and doing math with them at the point of usage. But, by parsing the inputs into the language's timestamp representation, their internal interfaces were much cleaner and their purpose was much more obvious since that math could be exposed at the invocation and not the function logic, and thus necessarily, through complex function names.

solomonb•14m ago

I disagree. I think the key insight is to carry the proof with you in the structure of the type you 'parse' into.

munk-a•12m ago

I think that's an excellent way to build a defensive parsing system but... I still want to build that and then put a validator in front of it to run a lot of the common checks and make sure we can populate easy to understand (and voluminus) errors to the user/service/whatever. There is very little as miserable as loading a 20k CSV file into a system and receiving "Invalid value for name on line 3" knowing that there are likely a plethora of other issues that you'll need to discover one by one.

danieltanfh95•21m ago

Hot take: Static typing is often touted as the end all be all, and all you need to do is "parse, don't validate" at the edge of your program and everything is fine and dandy.

In practice, I find that staunch static typing proponents are often middle or junior engineeers that want to work with an idealised version of programming in their heads. In reality what you are looking for is "openness" and "consistency", because no amount of static typing will save you from poorly defined or optimised-too-early types that encode business logic constraints into programmatic types.

This is also why in practice alot of customer input ends up being passed as "strings" or have a raw copy + parsed copy, because business logic will move faster than whatever code you can write and fix, and exposing it as just "types" breaks the process for future programmers to extend your program.

solomonb•11m ago

This is such a tired take. The burden of using static types is incredibly minimal and makes it drastically simpler to redesign your program around changing business requirements while maintaining confidence around program behavior.

jghn•11m ago

> I find that staunch static typing proponents are often middle or junior engineeers

I wouldn't go this far as it depends on when the individual is at that phase of their career. The software world bounces between hype cycles for rigorous static typing and full on dynamic typing. Both options are painful.

I think what's more often the case is that engineers start off by experiencing one of these poles and then after getting burned by it they run to the other pole and become zealous. But at some point most engineers will come to realize that both options have their flaws and find their way to some middle ground between the two, and start to tune out the hype cycles.

kayo_20211030•13m ago

A great piece.

Unfortunately, it's somewhat of a religious argument about the one true way. I've worked on both sides of the fence, and each field is equally green in its own way. I've use OCaml, with static typing, and Clojure, with maybe-opt-in schema checking. They both work fine for real purposes.

The big problem arrives when you mix metaphors. With typing, you're either in, or you're out - or should be. You ought not to fall between stools. Each point of view works fine, approached in the right way, but don't pretend one thing is the other.

r4victor•2m ago

It seems modern statically-typed and even dynamically-typed languages all adopted this idea, except Go, where they decided zero values represent valid states always (or mostly).

A sincere question to Go programmers – what's your take on "Parse, Don't Validate"?

ICE Is Expanding Across the US at Breakneck Speed. Here's Where It's Going Next

FemtoClaw – Tiny rust version of OpenClaw

Show HN: I just want *one page* to see all investments, so that's what I built

The Scientist and the Simulator

YC just hosted Boris, the creator of Claude Code

Real-Time Startup Arena >> Betabeast

Show HN: Fabraix Playground – Weekly Wordle for Breaking AI Agents

Windhawk: The customization marketplace for Windows and programs

Trump threatens to block opening of US-Canada bridge

Show HN: OwlPulse – $9/mo uptime monitoring for indie devs

People with obesity 70% more likely to be hospitalised by or die from infection

Mathematicians disagree on the essential structure of the complex numbers

Safe Chrome extension that can auto summarize articles

Show HN: Verifly – Email verification API at $5/10k (vs $75 for competitors)

Cadence ChipStack AI Super Agent Demo Overview [video]

Should Memory and Learning layer be built in-house?

Show HN: Open sourcing our ERP (Sold $500k contracts, 7k stars)

Large tech companies don't need heroes

National Lab of the Rockies, formerly NREL, lays off more than 130 employees

Show HN: A design collaboration layer for local LLM CLIs

The Case for Scaling Venture

Show HN: I built a visual node system for CI/CD that supports GitHub Actions

Superlinear Returns (2023)

Surf's Up in Slop City

Digital Iris [video]

Frost Bros, Rope Makers and Yarn Spinners

Ask HN: Are automated tests (Selenium) still relevant today?

IP Address Space for Outer Space

122

Novo Nordisk sues Hims after $49 weight-loss pill sparks FDA backlash