(For my purposes, I went with local running, generating walking-distance isochrones across pretty much well the entire UK)
How would you give directions to something in the middle of a park?
Rarely if ever do people use road names to direct pedestrians, or car drivers. I guess, the people don't know them. I wouldn't.
Info and schema is here, https://download.geonames.org/export/dump/readme.txt
Could be a good source. Not sure how good it is worldwide, but the countries I’ve used it for, it’s been useful and pretty good.
Try the search too, https://www.geonames.org/search.html?q=R%C3%ADo+grande&count...
Not just roads, but there’s rivers, and other things too
Thanks!
You can use admin fields, and it’s a recursive query to find.
I have recursive CTE (thanks to ChatGPT).
Could also be done on save, since they shouldn’t change for locations.
The recursiveness though, gives you a benefit if you extract type and save the intermediate steps, it allows you to start grouping things together at different levels which is one of the use cases you mentioned.
It has fairly comprehensive coverage of countries, cities, and major landmarks. It also has stable, simple identifiers that are somewhat of a lingua-franca in the geospatial data world (i.e. Geonames ID 5139572 points to the Statue of Liberty and if you have other data that you need to unambiguously associate with the one Statue of Liberty in New York Harbor, putting a `geonames_id` column in your database with that integer will pretty much solve it, and will allow anyone else you work with to understand the connection clearly too).
However, to be honest, it hasn't really kept pace with modern times. The velocity of changes and updates is pretty low, it doesn't actively grow the community anymore. The data format is simple and rigid and built on old tech that's increasingly hard to work with. You can trust Geonames to have the Statue of Liberty, but not the latest restaurants in NYC.
For a problem like the post author has of finding ways everyday people can easily navigate to something like a park bench that might not have a single address associated with it, or even if it does, needs more granularity to find _that_ specific bench in a park with 100 benches, Geonames probably won't help.
Source: I'm co-founder of Geocode Earth, one of the geocoding companies linked in the blog post. We use Geonames as one source of POI data amongst many others.
This problem has been the subject of intense interest by the defense research community for decades. It has been conjectured to be an AI-complete type problem for at least ten years, i.e. solving it is equivalent to solving AGI. The current crop of LLM type AI persistently fails at this class of problems, which is one of the arguments for why LLM tech can’t lead to true AGI.
I had long conversations with Esri's projection engine lead. Really remarkable guy - he's got graduate degrees in geography and math (including a PhD) and he's an excellent C/C++ developer. That kind of expertise trifecta is rare. I'd walk by his office and sometimes see him working out an integral of a massive equation on his whiteboard (not that he didn't also use a CAS). "Oh yeah, I'm adding support for a new projection this week."
You write a query of all the different kinds of addresses you'd like to display. The query result is a list of valid candidate addresses for the point matching at least one format that you can rank based on whatever criteria you like.
[0] https://developers.google.com/maps/documentation/geocoding/r...
If you have only a name field that has a single value that is going to be a crazy workaround. If your names are referencing a person with a date that is much easier. But you need to make that ddcision pretty early.
Of course, you don't get that historical data, but you do get it going forward from there.
Basically you keep an history of all changes so you can always roll-back / get that data if needed?
But this seems like it's probably a better solution.
I think the real magic is it lleverages the WAL (write ahead logs) from pg engine itself, which you could certainly hook up into too, but im not a db expert here
https://mariadb.com/kb/en/temporal-tables/
and if your schema doesn't change much, it's practically free to implement, much easier and simpler than copypasting audit tables, or relying on codegen to do the same.
> Double-dated data—we tag each bit of business data with 2 dates:
> * The date on which the data changed out in the real world, the effective date.
> * The date on which the system found out about the change, the posting date.
> Using effective & posting dates together we can record all the strange twists & turns of feeding data into a system.
[0] https://tidyfirst.substack.com/p/eventual-business-consisten...
A less obvious issue is that to make this work well, you need to do time interval intersection searches/joins at scale. There is a dearth of scalable data structures and algorithms for this in databases.
There's a great visualizer of the coordinate velocity from the Earthscope team:
https://www.unavco.org/software/visualization/GPS-Velocity-V...
What?
The motion of tectonic plates can be calculated relative to this spatial reference system but they are not part of the spatial reference system and would kind of defeat the purpose if they were.
To correct for these cases you need to be able to separately attribute drift vectors due to the spatial reference system, plate tectonics, and other geophysical phenomena. Without a timestamp that allows you to precisely subtract out the spatial reference system drift vector, the magnitude of the uncertainty is quite large.
I’d imagine older coordinates would work with the earlier CRS?
But I can understand not all coordinates specify their CRS. This have really been an issue for me personally, but I’ve mostly worked with NSW spatial and the Australian Bureau of statistics geodata.
For this reason, many people use the reproducibility rather than instrument precision as the noise floor. It doesn’t matter how precise an instrument you use if the “fixed point” you are measuring doesn’t sit still relative to any spatial reference system you care to use.
In most scientific and engineering domains, a high-precision, high-accuracy measurement is assumed to be reproducible.
But that's at least better than when it's some local place name which it's never heard of, and thinks sounds most similar to a place in Afghanistan (this happens all the time).
And to add to it, there are administrative regions, and ecclesiastical regions. Do you put them in the parish, or in the municipality? The birth in the parish and the baptism in the municipality, maybe? How about the burial then...
Uses OpenStreetmap file, Python and SQLite3.
First it finds all addresses using +/- like a square from lat/lon, then calculate distance based on the smaller list (Pythagoras), and pick the closest. It expands until a set maximum if no address is found in the first search.
Except for emergency dispatch and a few high-profile use cases, you can have a good enough address to let the user find its neighbourhood. But they still have the GPS or other form of address coding, so they can find the exact location easily. I'd say 99.9% of the cases are like that. The rest can be solved quickly by looking at the map!
Ultimately, I just want something which is a nice balance between being useful for a human and not so long that it is overwhelming.
The final step in the process “Wait for complaints” seems like a smart acceptance of the “perfect is the enemy of good” challenge
Publish and be damned, or as we say now: Move fast and break things
80% of the problem is just transforming floating point coordinates into API calls.
Getting to something useful with it is the hard 20%, and it will be a diminishing returns problem after that.
While not anybody's LLM proponent, that last mile might be a good AI application.
Unfortunately (in my view), group #1 is making all the products and is responsible for the majority of applications of technology that get deployed. Obviously this is the case because they will take on projects that group #2 cannot, and have no compunction against shipping them. And we can see the results with our eyes. Terrible software that constantly underestimates the number and frequency of these "edge cases" and defects. Terrible software that still requires the user to do legwork in many cases because the developers made an incorrect assumption or had bad input data.
AI is making this problem even worse, because now we don't even know what the systems can and cannot do. LLMs nondeterministically fail in ways that sometimes can't even be directly corrected with code, and all engineering can do is stochastically fix defects by "training with better models."
I don't know how we get out of this: Every company is understandably biased towards "doing now" rather than "waiting" to research more and make a better product, and the doers outcompete the researchers.
This is an interesting take, and I think I see where you're coming from..
My first thought on "why" is that so many products today are free to the user, meaning the money is made elsewhere, and so the experience presented to the user can be a lot more imperfect or non-exhaustive than it would otherwise have to be if someone was paying to use that experience.
So edge cases can be ignored because really you're looking for a critical mass of eyeballs to sell to advertisers or to harvest usage data from, etc.. If a small portion of your users has a bad time or experiences errors, well, you get what you pay for as they say..
And does that kind of pervasiveness now mean that many engineers think this is just the way to go no matter what?
It is a deeply complex data model that changes millions of times a day in unpredictable ways. Unfortunately, many applications are very sensitive to the local accuracy of the model, which is much higher variance than average accuracy. Only trying to be “good enough” in an 80/20 rule sense is the same as “broken”. The updates are also noisy and often contain errors, so the process has to be resilient to those errors.
The resistance of the problem to automation and the high rate of change has made it extremely expensive to asymptotically converge on model with consistently acceptable accuracy for the vast majority of applications.
That depends on your definition of "clear and simple" and "address" :) While a lot boils down to use case - are you trying to navigate somewhere, or link a string to an address? - even figuring out what is an address can be hard work. Is an address the entrance to a building? Or a building that accepts postal deliveries? Is the "shell" of a building that contains a bunch of flats/apartments but doesn't itself have a postal delivery point or bills registered directly to it an address? How about the address the a location was known as 1 year ago? 2 years ago? 10 years ago?
Park and other public spaces can be fun; they may have many local names that are completely different to the "official" name - and it's a big "if" whether an official name exists at all. Heck, most _roads_ have a bunch of official names that are anything but the names people refer to them as. I have a screaming obsession with the road directly in front of Buckingham Palace that, despite what you see on Google Maps, is registered as "unnamed road" in all of the official sources.
> Addresses don't change often
At the individual level, perhaps. In aggregate? Addresses change all the time, sometimes unrecognisably so. City and town boundaries are forever expanding and contracting, and the borders between countries are hardly static either (and if you're ever near the Netherlands / Belgium border, make a quick trip to Baarle-Hertog and enjoy the full madness). Thanks to intercontinental relative movement, the coordinates we log against locations have a limited shelf life too. All of the things I used to think were certain...
If someone hasn't done "faleshoods programmers believe about addresses," I think its time might be now!
Edit: answering myself with https://www.mjt.me.uk/posts/falsehoods-programmers-believe-a...
- I avoid relying on any generic location name/description provided by these APIs. Always prefer structured data whenever possible, and build the locality name from those components (bonus points if you let the user specify a custom format).
- Identifying those components itself is tricky. As the author mentioned, there are countries that have states, others that have regions, other that have counties, or districts, or any combination of those. And there are cities that have suburbs, neighbourhoods, municipalities, or any combination. Oh, and let's not even get started with address names - house numbers? extensions? localization variants - e.g. even the same API may sometimes return "Marrakesh" and sometimes "Marrakech"? and how about places like India where nearby amenities are commonly used instead of house numbers? I'm not aware of any public APIs out there that provide these "expected" taxonomies, preferably from lat/long input, but I'd love to be proven wrong. In the absence of that, I would suggest that is better to avoid double-guessing - unless your software is only intended to run in a specific country, or in a limited number of countries and you can afford to hardcode those rules. It's probably a good option to provide a sensible default, and then let the user override it. Oh, and good catch about abbreviations - I'd say to avoid them unless the user explicitly enables them, if you want to avoid the "does everybody know that IL is Illinois?" problem. Just use "Illinois" instead, at least by default.
- Localization of addresses is a tricky problem only on the surface. My proposed approach is that, again, the user is king. Provide English by default (unless you want to launch your software in a specific country), and let the user override the localization. I feel like the Nominatim's API approach is probably the cleanest: honor the `Accept-Language` HTTP header if available, and if not available, fallback to English. And then just expose that a setting to the user.
- Bounding boxes/polygons can help a lot with solving the proximity/perimeter issue. But they aren't always present/sufficiently accurate in OSM data. And their proper usage usually requires the client's code to run some non-trivial lat/long geometry processing code, even to answer trivial questions such as "is this point inside of this enclosed amenity?" Oh, and let's not even get started with the "what's the exact lat/long of this address?" problem. Is it the entrance of the park? The middle of it? I remember that when I worked with the Bing in the API in the past they provided more granular information at the level of rooftop location, entrance location etc.
- Providing localization information for public benches isn't what I'd call an orthodox use-case for geo software, so I'm not entirely sure of how to solve the "why doesn't everything have an address?" problem :)
Another problem is choosing which authority for the "correct" address. I've seen many cases where the official postal address city/town name is different than the 911 database. For example Canada Post will say some street addresses are in Dartmouth, while the official civic address is really Cole Harbour. https://www.canadapost-postescanada.ca/ac/ https://nsgi.novascotia.ca/civic-address-finder/
Even streets can have multiple official names/aliases. People who live on "East Bay Hwy", also live on "Highway 4", which is an alias.
I was working with a team that was wrapping up a period of many different projects (including a reverse geocoding service) and adopting one major system to design and maintain. The handover was set to be after the new year holidays and the receiving teams had their own exciting rewrites planned. I was on call the last week of the year and got an alert that sales were halted in Taiwan due to some country code issue and our system seemed at fault. The customer facing application used an address to determine all sorts of personalization stuff: what products they're shown, regulatory links, etc. Our system was essentially a wrapper around Google Maps' reverse geocoding API, building in some business logic on top of the results.
That morning, at 3am, the API stopped serving the country code for queries of Kinmen County. It would keep the rest of the address the same, but just omit the country code, totally botching assumptions downstream. Google Maps seemingly realized all of a sudden what strait the island was in, and silently removed what some people dispute.
Everyone else on the team was on holiday and I couldn't feasibly get a review for any major mitigations (e.g. switching to OSM or some other provider). So I drew a simple polygon around the island, wrote a small function to check if the given coordinates were in the polygon, and shipped the hotfix. Happily, the whole reverse geocoding system was scrapped with a replacement by February.
Also interesting that there's a Japanese island only 60 miles from Taiwan on the other side. I guess claims to small Pacific islands have been weird for a long time.
What the author is looking for is administrative divisions and boundaries[1], in particular probably down to level 3 which is the depth my game goes to. These differ in size greatly by country. With admin boundaries you need to accept there is no one-size-fits-all solution and embrace the quirks of the different countries.
For my game I downloaded a complete database of global admin boundaries[2] and imported them into PostgreSQL for lightning fast querying using PostGIS.
[1] https://en.wikipedia.org/wiki/List_of_administrative_divisio...
Reverse geocoding then becomes a problem of figuring out which polygons contain the point with a simple query and which POIs/streets/etc. are closest based on perpendicular distance. For that, I simply did a radius search and some post processing on any street segments. Probably not perfect for everything. But it worked well enough. My goal was actually being able to group things by neighborhood and microneighborhoods (e.g. squares, nightlife areas, etc.).
This should work well enough with anything that allows for geospatial queries. In a pinch you can use geohashes (I actually did this because geospatial search was still a bit experimental in ES).
Hasn't What3Words already solved this?
I hit a dozen random benches. If these are the low requirements, W3W would and does do a great job for openbenches: https://openbenches.org/bench/random/
the_arun•7h ago
edent•7h ago