With relatively minimal effort, I was able to spin up a little standalone container that wrapped around the service and exposed a basic API to parse a raw address string and return it as structured data.
Address parsing is definitely an extremely complex problem space with practically infinite edge cases, but libpostal does just about as well as I could expect it to.
They've managed to create a great working implementation of a very, very small model of a very specific subset of language.
<https://news.ycombinator.com/item?id=18775099> Libpostal: A C library for parsing/normalizing street addresses around the world - 117 points by polm23 on Dec 29, 2018 (25 comments)
<https://news.ycombinator.com/item?id=11173920> Libpostal: international street address parsing in C trained on OpenStreetMap (mapzen.com) 74 points by riordan on Feb 25, 2016 (7 comments)
The problem is the hardest to parse addresses are also often the hardest to match, making the problem somewhat circular. I wrote about this more in a recent blog on address matching: https://www.robinlinacre.com/address_matching/
Discussed on HN here: https://news.ycombinator.com/item?id=8907301
And because I had no idea before I worked on a project where we had to deal with customer data: many companies also use commercial services for address and phone number validation and normalization.
Addresses are fundamentally unstructured data. You can't validate them structurally. It's trivial to create nonexistent addresses which any parsing library will parse just fine. On the flipside, there's enough variety in real addresses that your parser has to be extremely tolerant in what it accepts--so tolerant that it basically tolerates everything. The entire purpose of a parser for addresses is to reject invalid addresses, so if your parser tolerates everything it's pointless.
The only validation that makes any sense is "does this address exist in the real world?". And the way to do that is not parsing, it's by comparing to a dataset of all the addresses in the world.
I haven't evaluated this project enough to understand confidently what they're doing, but I hope they're approaching this as a search engine for address datasets, and not as a parsing/normalizing library.
A trivially simple example of just how messy this is when people try to constrain it is that it's nearly random whether or not a given carrier would insist on me giving an incorrect address for my previous place, seemingly because traditionally and prior to 1965 the address was in Surrey, England.
The "postcode area name" for my old house is Croydon, and Croydon has legally been in London since 1965, and was allocated it's own postcode area in 1966. "Surrey" hasn't been correct for addresses in Croydon since then.
But at least one delivery company insisted my old address was invalid unless I changed the town/postcode area to "Surrey", and refused to even attempt a delivery. Never mind they had my house number and postcode, which was sufficient to uniquely identify my house.
But notably, to validate a parser/normalizer, you need this dataset anyway, so creating a parser/normalizer isn't even saving you that work. It's just giving you a worse result for more work.
You are equating two things that are not equatable.
It sounds like people at those businesses equated a dataset to the real world, not me. You're an adult, direct your frustrations appropriately.
> Data is not the same as reality.
That glosses over a lot of nuance.
Obviously, no dataset perfectly represents reality. But, this fact is often used to dismiss data entirely, resulting in people making decisions with absolutely no evidence whatsoever.
An appropriate use of an address database might be: when the user enters an address not in the database, do a fuzzy search and suggest the best match you can find, asking "Did you mean X?" At that point, if the user says, "No, I really meant what I put in," then you accept the data they gave you. This catches most mistakes while allowing users to put in addresses that aren't in your dataset.
> The entire purpose of a parser for addresses is to reject invalid addresses, so if your parser tolerates everything it's pointless.
It's bizarre to me that you're telling me I said things I didn't say, and then quoting things that don't say what you're claiming they say.
I'm saying that they should not use the parser, because the only ways it can influence the site's behavior are too buggy to be useful.
Third on right of main,
Tiwi College,
Melville Island, 0822, AU.
You can try to normalize that... But "Main Road" is in another city. Because I wasn't living in a city. There were no road names. And the 3rd position was an empty plot, not the third house. We had a bunch of houses around a strip of land, a few minutes from the airstrip - the only egress. Streetname 5, behind the glazier business.
It might say <some other name> on the door
That's very specific, but also not really an address.(For today’s 10000, that’s Terry Pratchett. The autocrat of the city of Ankh-Morpork amuses himself, at times, by figuring out where unreadably-addressed mail should go - in this case, a baker (“duzbuns” == does buns) across the street (“hopsit” == opposite) from a pharmacy, which in his extremely detailed knowledge of the city means only one place.)
After years of undeliverable mail it was found that the building permit for the dorm was registered incorrectly by the city and as a result the rooms were never registered as residential addresses in the postal DB.
What are some others?
IIRC it takes gigs of storage space and has significant runtime requirements.
Also, while it's implemented in C there are language binding for most major languages [1].
It's one of those things where it's most likely best deployed as an independent service on a dedicated machine.
[1] https://github.com/openvenues/libpostal?tab=readme-ov-file#b...
jandrese•7mo ago
monero-xmr•7mo ago
derdi•7mo ago
Why would one try to "verify" addresses that one knows nothing about?
> because the mailman "just knows"
The mailman does "just know", and the mailman is who the address is for. Web forms I have seen that have tried to "verify" my address have never done so in a way that made the address better for the mailman.
EDIT: I've long thought that web forms should not have separate "street", "street line 2", "number", "apartment", "whatever" fields. Instead they should offer a multi-line input field labeled "this will go straight on the address label, write whatever you like but it's your problem if it doesn't arrive". You'd probably still need separate fields for town/postcode for calculating postage. And of course it wouldn't work because the downstream delivery company would also insist on something it can "verify".
kevin_thibedeau•7mo ago
devilbunny•7mo ago
I would be more suspicious of this story if I hadn’t seen that the registers were, actually, in the back. And they didn’t have a pickup window back there or anything.
jandrese•7mo ago
So you aren't shipping your product to some place that doesn't exist. Also, some KYC requires that you verify the address of the person.
derdi•7mo ago
But businesses can't usually verify whether a place exists. The best they can usually do is to verify whether a place has an entry in their database of supposedly all places that supposedly existed at a point in time that is necessarily in the past.
That's not the same thing. Trust me, I would know: I live in a new-ish building, and for at least two years after it was completed and people were living here, some businesses still refused to take my money because they claimed that my address didn't exist. That was neither in their interest nor in mine.
> Also, some KYC requires that you verify the address of the person.
Define "verify". Verify that they provided some address that exists somewhere, possibly unconnected to the person? Worthless. Verify that they can receive mail at said address? OK, but doesn't require you to parse the address, just to print it onto a label and let the post office worry about it.
mpeg•7mo ago
grapesodaaaaa•7mo ago
The FAA even legally accepts “third house down from the barn” in some instances.
The KYC scenario is different, and a PITA for people like me, because I spent half my life without a physical mailing address (we picked it up at the post office).
The real world is messy, and u feel like SV and finance have done a lot of hand waving to ignore this.