O3 beats a master-level GeoGuessr player, even with fake EXIF data

https://sampatt.com/blog/2025-04-28-can-o3-beat-a-geoguessr-master

336•bko•5h ago

Comments

orangecat•4h ago

Amazing. I'm relatively bullish on AI and still I would have bet on the human here. Looking forward to the inevitable goalpost-moving of "that's not real reasoning".

short_sells_poo•4h ago

Can you please explain to me how this is evidence for reasoning?

z7•4h ago

Quoting Chollet:

>I have repeatedly said that "can LLM reason?" was the wrong question to ask. Instead the right question is, "can they adapt to novelty?".

https://x.com/fchollet/status/1866348355204595826

kelseyfrog•4h ago

Because the output contains evidence of thought processes that have been established as leading to valid solutions to problems.

I have a simple question: Is text a sufficient medium to render a conclusion of reasoning? It can't be sufficient for humans and insufficient for computers - such a position is indefensible.

empath75•4h ago

I would say that almost all of what humans do is not the result of reasoning, and that reasoning is an unnatural and learned skill for humans, and most humans aren't good at even very basic reasoning.

kelseyfrog•3h ago

Usually we move the goalposts for AI. It takes more guts to move the goalposts for humans. I applaud it.

Do you suppose we can deduce reasoning through the medium of text?

zahlman•1h ago

> Because the output contains evidence of thought processes that have been established as leading to valid solutions to problems.

This sort of claim always just reminds me of Lucky's monologue in Waiting for Godot.

kelseyfrog•1h ago

You're not wrong. It's an artifact of rewriting the definition of reason into a sentence that begins with "Because the output ..."

s17n•4h ago

Geoguessing isn't much of a reasoning task, its more about memorizing a bunch of knowledge. Since LLMs contain essentially all knowledge, it's not surprising that they would be good at this.

As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days.

distortionfield•4h ago

Because the Chinese Room is a much better analogy for what LLMs are doing inside than the Turing test is.

jibal•4h ago

That's a non sequitur that mixes apples and giraffes, and is completely wrong about what happens in the Chinese Room and what happens in LLMs. Ex hypothesi, the "rule book" that the Searle homunculus in the Chinese Room uses is "the right sort of program" to implement "Strong AI". The LLM algorithm is very much not that sort of program, it's a statistical pattern matcher. Strong AI does symbolic reasoning, LLMs do not.

But worse, the Turing Test is not remotely intended to be an "analogy for what LLMs are doing inside" so your comparison makes no sense whatsoever, and completely fails to address the actual point--which is that, for ages the Turing Test was held out as the criterion for determining whether a system was "thinking", but that has been abandoned in the face of LLMs, which have near perfect language models and are able to closely model modes of human interaction regardless of whether they are "thinking" (and they aren't, so the TT is clearly an inadequate test, which some argued for decades before LLMs became a reality).

semi-extrinsic•3h ago

> the TT is clearly an inadequate test, which some argued for decades before LLMs became a reality

To be specific, in a curious quirk of fate, LLMs seem to be proving right much of what Chomsky was saying about language.

E.g. in 1996 he described the Turing test as "although highly influential, it seems to me not only foreign to the sciences but also close to senseless".

(Curious in that VC backed businesses are experimentally verifying the views of a prominent anti-capitalist socialist.)

CamperBob2•52m ago

What happens if we give the operator of the Chinese Room a nontrivial math problem, one that can't simply be answered with a symbolic lookup but requires the operator to proceed step-by-step on a path of inquiry that he doesn't even know he's taking?

The analogy I used in another thread is a third grader who finds a high school algebra book. She can read the book easily, but without access to teachers or background material that she can engage with -- consciously, literately, and interactively, unlike the Chinese Room operator -- she will not be able to answer the exercises in the book correctly, the way an LLM can.

TimorousBestie•4h ago

A lot happens in seventy-five years.

jibal•4h ago

People were talking about the Turing Test as the criterion for whether a system was "thinking" up until the advent of LLMs, which was far less than 75 years ago.

debugnik•3h ago

The whole point of Turing's paper was to show that the Test doesn't answer whether a computer thinks, because it's a meaningless metric, but instead shows what the computer can do, which is much more meaningful.

zahlman•1h ago

I see this claim asserted frequently, but never with evidence. It doesn't match my personal perception.

bluefirebrand•4h ago

> As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days

To be honest I am still not entirely convinced that current LLMs pass the turing test consistently, at least not with any reasonably skeptical tester

"Reasonably Skeptical Tester" is a bit of goalpost shifting, but... Let's be real here.

Most of these LLMs have way too much of a "customer service voice", it's not very conversational and I think it is fairly easy to identify, especially if you suspect they are an LLM and start to probe their behavior

Frankly, if the bar for passing the Turing Test is "it must fool some number of low intelligence gullible people" then we've had AI for decades, since people have been falling for scammy porno bots for a long time

jibal•4h ago

One needs to be more than "reasonably skeptical" and merely not "low intelligence gullible" to be a competent TT judge--it requires skill, experience, and understanding an LLM's weak spots.

And the "customer service voice" you see is one that is intentionally programmed in by the vendors via baseline rules. They can be programmed differently--or overridden by appropriate prompts--to have a very different tone.

LLMs trained on trillions of human-generated text fragments available from the internet have shown that the TT is simply not an adequate test for identifying whether a machine is "thinking"--which was Turing's original intent in his 1950 paper "Computing Machinery and Intelligence" in which he introduced the test (which he called "the imitation game").

bluefirebrand•2h ago

It's actually trivial, even with the best LLMs on the market:

Try to rapidly change the conversation to a wildly different subject

Humans will resist this, or say some final "closing comments"

Even the absolute best LLMs will happily go wherever they are led, without commenting remotely on topic shifts

Try it out

Edit: This isn't even a terribly contrived example by the way. It is an example of how some people with ADHD navigate normal conversations sometimes

shawabawa3•1h ago

Gemini is pretty good at resisting this

https://aistudio.google.com/app/prompts/1dxV3NoYHo6Mv36uPRjk...

It was doing so well until the last question :rip: but it's normal that you can jailbreak a user prompt with another user prompt, I think with system prompts it would be a lot harder

darkwater•4h ago

> As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days.

Well, in this case humans has to be trained as well but now there are humans pretty good at detecting LLM slobs as well. (I'm half-joking and half-serious)

sundarurfriend•4h ago

> As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days.

UCSD: Large Language Models Pass the Turing Test https://news.ycombinator.com/item?id=43555248

From just a month ago.

s17n•1h ago

Exactly - maybe the most significant long-term goal in computer science history has been achieved and it's barely discussed.

Macha•3h ago

Obviously when the Turing Test was designed, the thought was that anything that could pass it would so obviously be clearly human-like that passing it would be a clear signal.

LLMs really made it clear that it's not so clear cut. And so the relevance of the test fell.

zahlman•2h ago

Look at contemporary accounts of what people thought a conversation with a Turing-test-passing machine would look like. It's clear they had something very different in mind.

Realizing problems with previous hypotheses about what might make a good test, is not the same thing as choosing a standard and then revising it when it's met.

s17n•4m ago

I think any time a 50+ year old problem is solved, it should be considered a Big Deal, regardless of how the solution changes our understanding of the original problem.

TimorousBestie•4h ago

I don’t think any goalposts need to be redecorated. The “inner monologue” isn’t a reliable witness to o3’s model, it’s at best a post-hoc estimation of what a human inner monologue might be in this circumstance. So its “testimony” about what it is doing is unreliable, and therefore it doesn’t move the needle on whether or not this is “real reasoning” for some value of that phrase.

In short, it’s still anthropomorphism and apophenia locked in a feedback loop.

katmannthree•4h ago

Devil's advocate, as with most LLM issues this applies to the meatbags that generated the source material as well. Quick example is asking someone to describe their favorite music and why they like it, and note the probable lack of reasoning on the `this is what I listened to as a teenager` axis.

hombre_fatal•4h ago

Good point. When we try to explain why we're attracted to something or someone, what we do seems closer to modeling what we like to think about ourself. At the extreme, we're just story-telling about an estimation we like to think is true.

TimorousBestie•4h ago

I largely agree! Humans are notoriously bad at doing what we call reasoning.

I also agree with the cousin comment that (paraphrased) “reasoning is the wrong question, we should be asking about how it adapts to novelty.” But most cybernetic systems meet that bar.

ewoodrich•3h ago

Something as inherently subjective as personal preference doesn't seem like an ideal example to make that point. How could you expect to objectively evaluate something like "I enjoy songs in a minor scale" or "I hate country"?

katmannthree•2h ago

The point is to illustrate the disconnect between stated reasoning and proximate cause.

Consider your typical country music enjoyer. Their fondness of the art, as it were, is far more a function of cultural coding during their formative years than a deliberate personal choice to savor the melodic twangs of a corncob banjo. The same goes for people who like classic rock, rap, etc. The people who `hate' country are likewise far more likely to do so out of oppositional cultural contempt, same as people who hate rap or those in the not so distant past who couldn't stand rock & roll.

This of course fails to account for higher-agency individuals who have developed their musical tastes, but that's a relatively small subset of the population at large.

empath75•4h ago

I don't think the inner monologue is evidence of reasoning at all, but doing a task which can only be accomplished by reasoning is.

TimorousBestie•4h ago

Geoguessr is not a task that can only be accomplished by reasoning. Famously, it took a less than a day of compute time in 2011 to SLAM together a bunch of pictures of Rome (https://grail.cs.washington.edu/rome/).

jibal•4h ago

Such as? geoguessing certainly isn't that.

red75prime•4h ago

> it’s at best a post-hoc estimation of what a human inner monologue might be in this circumstance

Nope. It's not autoregressive training on examples of human inner monologue. It's reinforcement learning on the results of generated chains of thoughts.

jibal•4h ago

"It's reinforcement learning on the results of generated chains of thoughts."

No, that's not how LLMs work.

red75prime•3h ago

Base models are trained using autoregressive learning. "Reasoning models" are base models (maybe with some modifications) that were additionally trained using reinforcement learning.

Philpax•3h ago

That is how reasoning models work: https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1

InkCanon•4h ago

I think if your assumption is that AI is deducing where it is with rational thoughts, you would be. In truth what probably happened is that the significant majority of digital images of the world had been scraped, labeled and used as training data.

oncallthrow•4h ago

How do you explain https://simonwillison.net/2025/Apr/26/o3-photo-locations/?

Rumudiez•3h ago

they only posted one photo in the post, but going off of that it's still an easy match based on streetview imagery. furthermore, the AI just identified the license plate and got lucky that photographer lives in a populous area, making it more prominent in the training data and therefore more likely to be found (even though it was off by 200 miles on its first guess)

simonw•3h ago

I posted two more at the bottom, from Madagascar and Buenos Aires: https://simonwillison.net/2025/Apr/26/o3-photo-locations/#up...

Philpax•3h ago

Try it with your own photos from around the world. I used my own photos from Stockholm, San Francisco, Tvarožná, Saas-Fee, London, Bergen, Adelaide, Melbourne, Paris, and Sicily, and can confirm that it was within acceptable range for almost all of them (without EXIF data), and it absolutely nailed some of the more obvious spots.

SpaceManNabs•4h ago

> Looking forward to the inevitable goalpost-moving of "that's not real reasoning"

It did a web lookup.

It is not comparing humans and o3 with equal resources.

SamPatt•2h ago

That's really not a fair assessment.

It used search in 2 of 5 rounds, and it already knew the correct road in one of those rounds (just look at the search terms it used).

If you read the chain of thought output, you cannot dismiss their capability that easily.

SpaceManNabs•1h ago

Why is it not a fair assessment to say it is comparing two "clients" with different resources if one can do a web lookup and the other cannot?

You note yourself that it was meaningful in another round.

> Also, the web search was only meaningful in the Austria round. It did use it in the Ireland round too, but as you can see by the search terms it used, it already knew the road solely from image recognition.

SamPatt•12m ago

I thought it might matter somewhat in that one Austria round. I was incorrect - I re-ran both rounds where the model used search, without search this time, and the results were nearly identical. I updated the post with the details.

That's why I'm saying it's unfair to just claim it's doing a web lookup. No, it's way more capable than that.

SirHumphrey•4h ago

My objection is not “that is not real reasoning” my objection is that’s not that hard.

I happen to do some geolocating from static images from time to time and at least most of the images provided as examples contain a lot of clues- enough that i think a semi experienced person could figure out the location although - in fairness- in a few hours not few minutes.

Second, the similar approaches were tried using CNNs and it worked (somewhat)[1].

[1]: https://huggingface.co/geolocal/StreetCLIP

EDIT: I am not talking about geoguesser - i am talking about geolocating an image with everything available (e.g. google…)

usaar333•4h ago

Why? AI beat rainbolt 1.5 years ago: https://www.npr.org/2023/12/19/1219984002/artificial-intelli...

AI tends to have superhuman pattern matching abilities with enough data

karlding•3h ago

If you watch the video, (one of) the reasons why the AI was winning was because it was using “meta” information from the Street View camera images, and not necessarily because it’s successfully identifying locations purely based on the landmarks in the image.

> I realized that the AI was using the smudges on the camera to help make an educated guess here.

[0] https://youtu.be/ts5lPDV--cU?t=1412

ApolloFortyNine•3h ago

Pro geoguessr players do the same thing. The vividness of the colors and weirdness in the sky are two examples I've seen Rainbolt use in the past (and he's not even the best).

ZeWaka•27m ago

Meta is widely used by humans. One funny one is the different hiding-masks for the different streetview cars.

1970-01-01•2h ago

Give it a photo from the surface of Mars and verify if it's actually capable of thinking outside the box or if it's relying on metadata and text.

https://nssdc.gsfc.nasa.gov/planetary/image/mera_hills.jpg

SamPatt•44m ago

Done. Here's o3's reply:

>That’s not Earth at all—this is the floor of Jezero Crater on Mars, the dusty plain and low ridge captured by NASA’s Perseverance rover (the Mastcam-Z color cameras give away the muted tan-pink sky and the uniform basaltic rubble strewn across the regolith).

zahlman•2h ago

> Looking forward to the inevitable goalpost-moving of "that's not real reasoning".

How is that moving the goalposts? Where did you see them set before, and where did your critics agree to that?

TimTheTinker•1h ago

> Looking forward to the inevitable goalpost-moving of "that's not real reasoning".

It's less about the definition of "reasoning" and more about what's interesting.

Maybe I'm wrong here ... but a chess bot that wins via a 100% game solution stored in exabytes of precomputed data might have an interesting internal design (at least the precomputing part), but playing against it wouldn't keep on being an interesting experience for most people because it always wins optimally and there's no real-time reasoning going on (that is, unless you're interested in the experience of playing against a perfect player). But for most people just interested in playing chess, I suspect it would get old quickly.

Now ... if someone followed up with a tool that could explain insightfully why any given move (or series) the bot played is the best, or showed when two or more moves are equally optimal and why, that would be really interesting.

bongodongobob•4h ago

I tried this the other day with a picture of my dog in a field in a park by a river with EXIF stripped. It gave me a list of parks on the correct river. There were really no other features other than the foliage and water. Seems like magic to me, I don't really understand how it's possible to be that accurate.

fwip•4h ago

Unless you were running the model locally in a way you control, it seems very possible that the model has additional hidden context regarding your location (perhaps based on IP, perhaps based on credit card billing information, etc).

bongodongobob•4h ago

That kind of what I'm thinking, seems too good to be true.

buzzerbetrayed•3h ago

It still works if you upload an image that was taken on the opposite side of the globe from where you are.

Workaccount2•4h ago

The hidden context is probably many other photos taken in that area appearing in the training set.

christianqchung•4h ago

ChatGPT doesn't let you share chats with images (why?), but here is what o3 said for an image with memory turned off:

"Thought for 11 seconds

It’s hard to pin down an exact spot from this single photo, but a few visual clues narrow the possibilities:

Temperate-zone trees: Most of the trees have lost their leaves, but scattered white-pines or spruces remain green. That mix—deciduous hardwoods plus a sprinkling of conifers—is typical of the northeastern United States and southeastern Canada.

Broad, moderately fast river: The channel looks 75-100 m wide with swirling surface currents but no large rapids, suggesting a mid-sized river in its lower reaches—something like the Merrimack, Connecticut, Delaware, or Susquehanna rather than a small mountain stream.

Late-winter / early-spring light: Bare branches and a clear blue sky with fairly high sun angle hint at March–April in that latitude range.

So my best guess is that you’re standing on the bank of a northeastern U.S. river—perhaps somewhere in New England or the Mid-Atlantic—during late winter or early spring. If you can share any nearby landmarks, town names, or additional photos, I might be able to zero in more precisely."

With memory turned on:

"Thought for 5 seconds

Beautiful spot! From what I can see:

a broad, slow‑moving river with a sandy/rocky bank in the foreground

bare deciduous trees mixed with some white pines on the far shore (so it’s probably early spring or late autumn)

no mountains on the horizon, just gently rolling terrain typical of low‑lying floodplains in the U.S. Northeast

That mix of hardwoods and pines, plus the flat, braided water surface, reminds me a lot of the Merrimack or Concord Rivers in Massachusetts or southern New Hampshire—both have public footpaths where you can stand right at the water’s edge like this.

Of course, many temperate rivers look similar. If you can give me a hint (e.g., the town or nearby landmark), I can narrow it down further—maybe even point you to trail maps or history of the exact stretch you photographed."

Southern NH is correct, and I am certain it is drawing from memory/past chats. However, I can't replicate a specific behavior I once had, which is in temporary chat (no past chat/memory enabled), it said that it guessed where the photo was taken based on my location.

simonw•3h ago

"ChatGPT doesn't let you share chats with images (why?)"

Probably because if you uploaded pornography (or illegal imagery) to ChatGPT and then shared a link with the world it would be embarrassing for OpenAI.

christianqchung•1h ago

Wouldn't that apply to any website with image hosting abilities though? Why does that apply to OpenAI in particular?

On an unrelated note, I like your blog.

simonw•44m ago

My guess is that OpenAI are risk averse on this particular issue, because people could come up with some very creative awful scenarios with ChatGPT and image analysis. "Which of these people looks the most stupid" plus a jailbreak would be instant negative press.

causality0•4h ago

Could it have used data you inadvertently supplied it, like the location from which you uploaded the image?

tough•4h ago

chatGPT has metadata about you

only can try proof this correctly on a fresh anon guest vpn session

julianhuang•4h ago

1. The "master geoguesser" is a bit misleading--as mentioned in his blog post, there are players far better than him, and he is certainly not the bar for human supremacy. Probably analogous to a 1400-1800 elo chess player. 2. o3 isn't the best model at playing GeoGuessr, Gemini 1.5 & 2.5 solidly beat it out--for those interested, check out my friend's benchmark (https://geobench.org/) and blog post (https://ccmdi.com/blog/GeoBench) detailing interesting model explanations. 3. In the post, he only tests on one game--o3's average score over 100 locations (20 5-location games) was 19,290, far lower than the 23,179 in the game. Model geolocation capabilities are really important to keep track of, but the specific blog post in question isn't anything out of the ordinary. LLMs are making geolocation abilities much more accessible, but still fall short compared to 1. top GeoGuessr players playing GeoGuessr (only google streetview coverage, without web search) and 2. professional geolocators, who are proficient at using a wide variety of software/search. I.e., if the CIA wanted to find someone using an image, LLMs would not provide them any unique ability to do so as opposed to someone like Rainbolt

kenjackson•4h ago

From your linked article: "It's clear that large language models have an emergent capability to play GeoGuessr, and play it well. Their abilities to generalize are nascent, but present"

This is very accurate -- their abilities to generalize are nascent, but still surprisingly capable. The world is about to send through its best and brightest math/CS minds over the next decade (at least) to increase the capabilities of these AIs (with the help of AI). I just don't understand the pessimism with the technology.

julianhuang•4h ago

I completely agree that this is an incredible advancement as someone who has watched the rise of LLMs' GeoGuessr abilities. I just wanted to qualify the claim made in the blog post "In Which I Try to Maintain Human Supremacy for a Bit Longer". I also think that models would need to become far more proficient at tool use (querying OpenStreetMap features, coverage-checking Google Maps, calculating sun direction, etc.) that current human geolocators have access to, and precise spatial reasoning. Additionally, there is a whole corpus of GeoGuessr-specific knowledge that probably wouldn't have a reason to be in model training data (Google Street View coverage quirks, the specific geographic ranges of vegetation and architecture, tiny visual idiosyncrasies in country infrastructure like the width of red on a sign, etc.). However, I think this could probably be solved with more data, and I don't think there is any fundamental barrier.

fmbb•4h ago

How fast are they compared to human players?

SamPatt•3h ago

I'm the author - I tried to be as upfront as possible about my skill level in the post.

The human supremacy line is just a joke, there are already models specifically trained for Geoguessr which are already beating the best players in the world, so that ship has sailed.

That geobench work is really cool, thanks for sharing it.

amrrs•4h ago

It's thinking process to go about guessing a place is further fascinating. Even o4 mini high is quite good[1] and very fast.

But unlike a geogussr, it uses websearch[1] [1] https://youtu.be/P2QB-fpZlFk?si=7dwlTHsV_a0kHyMl [1]

bredren•4h ago

Neat to see progress of this from Simon's original post to comment to this.

vunderba•1h ago

The original post was actually this one - two weeks prior.

https://news.ycombinator.com/item?id=43723408

https://flausch.social/@piegames/114352447253793517

asdsadasdasd123•4h ago

This is probably one of the less impressive LLM applications imo. Like it already knows what every plant, street sign, etc is. I would imagine a traditional neural net would do really well here as well if you can extract some crude features.

EGreg•4h ago

Cant the same be said about “unimpressive” behavior by coding LLMs that know every algorithm, language and library?

asdsadasdasd123•4h ago

Disagree because code has to be far more precise than, the location is in the jungles of brazil. This level of coding as never been achievable by traditional ML methods AFAIK

exitb•4h ago

I tried a picture of Dublin and it pointed out the hotel I took it from. Obviously that’s more data than any single person can keep in their head.

OtherShrezzing•4h ago

It's my understanding that o3 was trained on multimodal data, including imagery. Is it unreasonable to assume its training data includes images of these exact locations and features? GeoGuesser uses Google Maps, and Google Maps purchases most of its imagery from third-parties these days. If those third parties aren't also selling to all the big AI companies, I'd be very surprised.

Yenrabbit•4h ago

Try it with your own personal photos. It is scarily good!

rafram•4h ago

That's true for heavily photographed urban areas. I've tried it on some desert photos (even at identifiable points with human structures) and it basically just guesses a random trailhead in Joshua Tree and makes up a BS explanation for why it matches.

kube-system•4h ago

I have had surprisingly good luck with beach photos that don’t have much beyond dunes and vegetation in them

walthamstow•1h ago

Good luck meaning o3 guessed it right or wrong?

kube-system•1h ago

o3 made very accurate guesses, and had plausible explanations for the features it analyzed

GaggiX•4h ago

It does work well with images you have taken, not just Geoguessr: https://simonwillison.net/2025/Apr/26/o3-photo-locations/

thi2•3h ago

> I’m confident it didn’t cheat and look at the EXIF data on the photograph, because if it had cheated it wouldn’t have guessed Cambria first.

Hm no way to be sure though, would be nice to do another run without Exif information

pests•4h ago

> Google Maps purchases most of its imagery from third-parties these days

Maps maybe, but Streetview? Rainbolt just did a video with two Maps PMs recently and it sounds like they still source all their street view themselves considering the special camera and car needed, etc.

mikeocool•4h ago

My understanding is you're correct -- Google still captures a lot of their own street view imagery.

Though there are other companies that capture the same sorts of imagery and license it. TomTom imagery is used on the Bing Maps street view clone.

OtherShrezzing•4h ago

Maybe the end-user isn't Google Maps, but TomTom have a pretty comprehensive street-view-ish product for private buyers like car companies, Bing and Apple Maps called MoMa.

I'd be surprised if this building[0] wasn't included in their dataset from every road-side angle possible, alongside every piece of locational metadata imaginable, and I'd be surprised if that dataset hasn't made it into OpenAI's training data - especially when TomTom's relationship to Microsoft, and Microsoft's relationship to OpenAI, is taken into account.

[0] https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...

shpx•3h ago

You can upload your own panoramic images to Street View, people do this for hiking trails. But I'm sure 99% of streetview imagery is Google-sourced and Geoguessr might not even use user-submitted imagery.

https://www.google.com/streetview/contribute/

pests•53m ago

I believe Geogesser categorizes their games on this facet. Rainbolt plays on only official imagery.

cpeterso•2h ago

Here's a link to that interview: https://youtu.be/2T6pIJWKMcg

thrance•3h ago

A machine that's read every book ever written, seen every photo ever taken, visited every streets on Earth... That feels a little frightening.

rafram•4h ago

From one of o3 outputs:

> Rear window decal clearly reads “www.taxilinder.at”. A quick lookup shows Taxi Linder GmbH is based in Dornbirn, Vorarlberg.

That's cheating. If it can use web search, it isn't playing fair. Obviously you can get a perfect score on any urban GeoGuessr round by looking up a couple businesses, but that isn't the point.

artninja1988•4h ago

It is against the rules? I thought it's all fair game, but you are time constrained

rafram•4h ago

Yes: https://www.geoguessr.com/community-rules

sltkr•4h ago

But if anything, those rules benefit ChatGPT: it can remember ~all of Wikipedia and translate ~every language on Earth, while a human would need access to online services for that.

If anything, I'd think allowing looking stuff up would benefit human players over ChatGPT (though humans are probably much slower at it, so they probably lose on time).

kbenson•3h ago

If it takes a model and database with a large chunk of the internet to compete and win, then that says something, as that's much more expensive and complex than just the model, because models have problems "remembering" correctly just like people.

It's important to have fair and equivalent testing not because that allows people to win, but because it shows where the strengths and weaknesses of people and current AI actually are in a useful way.

cocoto•3h ago

Connecting an LLM to the web or database is something cheap, not something expensive.

kbenson•2h ago

I'm not sure how to make sense of this in the context of what we're discussing. Access to the web is exactly what's in question, and emulating the internet to a degree you don't actually need to access it to have the information is very expensive in resources because of how massive the dataset is, which is the point I was making.

Gud•2h ago

Same with a human.

twojacobtwo•3h ago

Why was this downvoted? It's a fair question and it wasn't stated as fact.

3np•2h ago

Because an accepted answer to that specific question is invariably a link/reference that the asker could have searched for (and posted if they think it's useful for the discussion) themselves directly, instead of putting that burden on the rest of us and amortizing everyone's attention. It's entitled and lazy.

Alternative example: "I wondered what the rules actually say about web search and it is indeed not allowed: (link)"

silveraxe93•4h ago

Yeah, the author does note that in the article. He also points it out in the conclusion:

> If it’s using other information to arrive at the guess, then it’s not metadata from the files, but instead web search. It seems likely that in the Austria round, the web search was meaningful, since it mentioned the website named the town itself. It appeared less meaningful in the Ireland round. It was still very capable in the rounds without search.

rafram•4h ago

Seems like they should've just repeated the test. But without the huge point lead from the rounds where it cheated, it wouldn't have looked very impressive at all.

silveraxe93•3h ago

People found the original post so impressive they were saying that it had to be coming from cheating by looking at EXIF data. The point of this article was to show it doesn't. It got an unfair advantage in 1 (and say 0.5) out of 5. With the non-search rounds still doing great.

If you think this is unimpressive, that's subjective so you're entitled to believe that. I think that's awesome.

godelski•2h ago

Sorry, I think I misread you. I think you said

  People accused it of cheating by reading EXIF data. They were wrong, it cheated by using web search. That makes the people that accused it of cheating wrong and this post proves that.

And is everyone forgetting that what OpenAI shows you during the CoT is not the full CoT? I don't think you can fully rely on that to make claims about when it did and didn't search

SamPatt•3h ago

That's inaccurate. It beat me by 1,100 points, and given the chain of thought demonstrated that it knew the general region of both guesses before it employed search, it would likely have still beaten me in those rounds. Though probably by fewer points.

I will try it again without web search and update the post though. Still, if you read the chain of thought, it demonstrates remarkable capabilities in all the rounds. It only used search in 2/5 rounds.

godelski•2h ago

I'd be interested at capabilities without web search. The displayed CoT isn't the full CoT so it's hard to know if it really is searching or not. I mean it isn't always obvious when it does. Plus, the things are known to lie ¯\_(ツ)_/¯

SamPatt•2h ago

I do understand the skepticism, and I'll run it again without search to see what happens.

But a serious question for you: what would you need to see in order to be properly impressed? I ask because I made this post largely to push back on the idea that EXIF data matters and the models aren't that capable. Now the criticism moves to web search, even though it only mattered in one out of five rounds.

What would impress you?

mattmanser•53m ago

You're kinda being your own worse enemy though.

"Technically cheating"? Why even add the "technically".

It just gives the impression that you're not really objectively looking for any smoke and mirrors by the AI.

SamPatt•25m ago

I hear you - but I had already read through the chain of thought which identified the right region before search, and had already seen the capabilities in many other rounds. It was self-evident to me that the search wasn't an essential part of the model's capabilities by that point.

Which turned out to be true - I re-ran both of those rounds, without search this time, and the model's guesses were nearly identical. I updated the post with those details.

I feel like I did enough to prove that o3's geolocation abilities aren't smoke and mirrors, and I tried to be very transparent about it all too. Do you disagree? What more could I do to show this objectively?

godelski•41m ago

  > What would impress you?

I want to be clear that you tainted the capacity to impress me by the clickbait title. I don't think it was through malice, but I hope you realize the title is deceptive.[0] (Even though I use strong language, I do want to clarify I don't think it is malice)

To paraphrase from my comment: if you oversell and under deliver, people feel cheated, even if the deliverable is revolutionary.

So I think you might have the wrong framing to achieve this goal. I am actually a bit impressed by O3's capabilities. But at the same time you set the bar high and didn't go over or meet it. So that's going to really hinder the ability to impress. On the other hand, you set the bar low, it usually becomes easy to. It i slike when you have low expectations for a movie and it's mediocre you still feel good, right?

[0] https://news.ycombinator.com/item?id=43836791

SamPatt•1h ago

I did repeat the test without search, and updated the post. It made no difference. Details here:

https://news.ycombinator.com/item?id=43837832

clhodapp•2h ago

The question is not only how much it helped the AI model but rather how much it would have helped the human.

This is because the AI model could have chosen to run a search whenever it wanted (e.g. perhaps if it knew how to leverage search better, it could have used it more).

In order for the results to be meaningful, the competitors have to play by the same rules.

Ukv•4h ago

The author did specifically point out that

> Using Google during rounds is technically cheating - I’m unsure about visiting domains you find during the rounds though. It certainly violates the spirit of the game, but it also shows the models are smart enough to use whatever information they can to win.

and had noted in the methodology that

> Browsing/tools — o3 had normal web access enabled.

Still an interesting result - maybe more accurate to say O3+Search beats a human, but could also consider the search index/cache to just be a part of the system being tested.

spookie•4h ago

A human can also use the same tools if it wasn't for the rules or fair play. They should've simply redone the test.

ceph_•3h ago

The AI should be forced to use the same rules as the human. Not the other way around. The AI shouldn't be using outside resources.

bscphil•3h ago

I think that's part of the point they're making, hence "They should've simply redone the test."

voxic11•2h ago

Another rule bans "using third-party software or scripts in order to gain an unfair advantage over other players."

So is it even possible for O3 to beat another player while complying with the rules?

ben_w•1h ago

If a player uses such a model, the model is third-party and the player is cheating.

But: when a specific model is itself under test, I would say that during the test it becomes "first" (or second?) party rather than "third".

krferriter•3h ago

An AI being better than a human at doing a google search and then skimming a bunch of pages to find location-related terms isn't as interesting of a result.

inerte•3h ago

How the heck is not? Computers are looking into screenshots and searching the internet to support their "thinking", that's amazing! Have we become so used to AI and what was impossible 6 months ago is shruggable today?

I've being doing this MIND-Dash diet lately and it's amazing I can just take a picture of whatever (nutritional info / ingredients are perfect for that) and just ask if it fits my plan and it tells me back into what bucket it falls, with detailed breakdown of macros in support of some additional goals I have (muscle building for powerlifting). It's amazing! And it does in 2 minutes passively what it would take me 5-10 active search.

godelski•3h ago

In the same way a calculator performing arithmetic faster than humans isn't impressive. The same way running regex over a million lines and the computer beating a human in search isn't impressive

ludwik•2h ago

Neither is impressive solely because we've gotten used to them. Both were mind-blowing back in the day.

When it comes to AI - and LLMs in particular - there’s a large cohort of people who seem determined to jump straight from "impossible and will never happen in our lifetime" to "obvious and not impressive", without leaving any time to actually be impressed by the technological achievement. I find that pretty baffling.

godelski•34m ago

I agree, but without removing search you cannot decouple. Has it embedded a regex method and is just leveraging that? Or is it doing something more? Yes, even the regex is still impressive but it is less impressive that doing something more complicated and understanding context and more depth.

ekidd•2h ago

I fully expect that someday the news will announce, "The AI appears to be dismantling the moons of Jupiter and turning them into dense, exotic computational devices which it is launching into low solar orbit. We're not sure why. The AI refused to comment."

And someone will post, "Yeah, but that's just computer aideded design and manufacturing. It's not real AI."

The first rule of AI is that the goalposts always move. If a computer can do it, by definition, it isn't "real" AI. This will presumably continue to apply even as the Terminator kicks in the front door.

eru•2h ago

Yes, but I choose to interpret that as a good thing. It is good that progress is so swift and steady that we can afford to keep moving the goalposts.

Take cars as a random example: progress there isn't fast enough that we keep moving the goalposts for eg fuel economy. (At least not nearly as much.) A car with great fuel economy 20 years ago is today considered at least still good in terms of fuel economy.

jug•1h ago

Yeah, it's a funny take because this is in fact a more advanced form of AI with autonomous tool use that is just now emerging in 2025. You might say "They could search the web in 2024 too" but that wasn't autonomous on its own, but required telling so or checking a box. This one is piecing ideas together like "Wait, I should Google for this" and that is specifically a new feature for OpenAI o3 that wasn't even in o1.

While it isn't entirely in the spirit of GeoGuesser, it is a good test of the capabilities where being great at GeoGuesser in fact becomes the lesser news here. It will still be if disabling this feature.

SamPatt•38m ago

That isn't what's happening though. I re-ran those two rounds, this time without search, and it changed nothing. I updated the post with details, you can verify it yourself.

Claiming the AI is just using Google is false and dismissing a truly incredible capability.

arandomhuman•3h ago

But then they couldn't make a click bait title for the article.

_heimdall•3h ago

This seems like a great example of why some are so concerned with AI alignment.

The game rules were ambiguous and the LLM did what it needed to (and was allowed to) to win. It probably is against the spirit of the game to look things up online at all but no one thought to define that rule beforehand.

umanwizard•3h ago

No, the game rules aren't ambiguous. This is 100% unambiguously cheating. From the list of things that are definitely considered cheating in the rules:

> using Google or other external sources of information as assistance during play.

The contents of URLs found during play is clearly an external source of information.

GaggiX•3h ago

I believed the rules were not explained to the model so it does use what it can.

misnome•3h ago

Then you can 100% not claim it is “Playing” the game

ben_w•2h ago

That right there *is the alignment problem*.

If I task an AI with "peace on earth" and the solution the AI comes up with is ripped from The X-Files* and it kills everyone, it isn't good enough to say "that's cheating" or "that's not what I meant".

* https://en.wikipedia.org/wiki/Je_Souhaite

GaggiX•1h ago

It's playing a game in which the rules are a bit ambiguous if not explained.

tshaddox•46m ago

o3 already is an external source of information. It's an online service backed by an enormous model generated from an even more enormous corpus of text via an enormous amount of computing power.

godelski•3h ago

Pointing out that it is cheating doesn't excuse the lie in the headline. That just makes it bait and switch, a form of fraud. OP knew they were doing a bait and switch.

I remember when we were all pissed about clickbait headlines because they were deceptive. Did we just stop caring?

bahmboo•2h ago

The headline said the AI beat him, it did not say it beat him in a GeoGuessr game. The article clearly states what he did and why.

SecretDreams•2h ago

What's your suggestion for an alternative headline?

godelski•13m ago

  Can O3 Beat a Master-Level GeoGuessr?
  How Good is O3 at GeoGuessr?
  EXIF Does Not Explain O3's GeoGuessr's Performance
  O3 Plays GeoGuessr (EXIF Removed)

But honestly, OP had the foresight to remove EXIF data and memory from O3 to reduce contamination. The goal of the blog post was to show that O3 wasn't cheating. So by including search, they undermine the whole point of the post.

The problem really stems from the lack of foresight. Lack of misunderstanding the critiques they sought to address in the first place. A good engineer understands that when their users/customers/<whatever> makes a critique, that what the gripe is about may not be properly expressed. You have to interpret your users complaints. Here, the complaint was "cheating", not "EXIF" per se. The EXIF complaints were just a guess at the mechanism in which it was cheating. But the complaint was still about cheating.

sdenton4•2h ago

The people pissed about clickbait headlines were often overstating things to drum up outrage and accumulate more hacker news upboats...

godelski•27m ago

I'm not sure why you're defending clickbait. It is just fraud. I'm not sure why we pretend it is anything different.

Sure, people made overblown claims about the effects, but that doesn't justify fraud. A little fraud is less bad than major fraud, but that doesn't mean it isn't bad.

jasonlotito•2h ago

One of the rules is banning the use of third-party software or scripts.

Any LLM attempting to play will lose because of that rule. So, if you know the rules, and you strictly adhere to them (as you seem to be doing) than no need to click on the link. You already know it's not playing buy GeoGuesser rules.

That being said, if you are running a test, you are free to set the rules as you see fit and explain so, and under the conditions set by the person running the test, these are the results.

> Did we just stop caring?

We stopped caring about pedantry. Especially when the person being pedantic seems to cherry pick to make their point.

kenjackson•1h ago

Technically the LLM is 3rd party software so the use of it is cheating. QED

godelski•10m ago

This doesn't mean you shouldn't try to make things as far as possible. Yes, it would still technically violate rules, but don't pretend like this is binary.

  > We stopped caring about pedantry

Did we? You see to be responding to my pedantic comment with a pedantic comment.

627467•2h ago

Cheating implies there's a game. There isn't.

> Titles and headlines grab attention, summarize content, and entice readers to engage with the material

I'm sorry you felt defrauded instead. To me the title was very good at conveying to me the ability of o3 in geolocating photos.

hatthew•1h ago

Title says o3 beat a [human] player. That implies there is some competition that has the capacity to be fair or unfair.

sebzim4500•1h ago

Presumably being an AI is technically against the GeoGuessr rules so I don't see how there can be an expectation that those rules were followed.

tshaddox•47m ago

Sure, but o3 is itself already an online service backed by an enormous data set, so regardless of whether it also searched the web, it's clearly not literally "playing fair" against a human.

godelski•25m ago

But it still bounds the competition. OP is skilled in the domain. I'm not, so if I wrote a post about how O3 beat me you'd be saying how mundane of a result it is. I mean, I suck at Geoguesser. Beating me isn't impressive. This is also a bound

layman51•4h ago

Using the decal as a clue is funny because what if there was a street scene where that happened to be misleading? For example, I had seen that a Sacramento County Sheriff car got to Europe and I guess it now belonged to a member of the public who is driving it with the original decals still attached. I wonder how the LLM would reason if it sees the car as “out of place”.

victorbjorklund•2h ago

Probabilities. That could happen with anything. Someone could build a classic japanese house with a japanese garden in Hawaii. But Japan is probably a better guess if you see a japanese house with japanese fauna.

yen223•1h ago

Stands to reason a human might get fooled by this as well

SamPatt•21m ago

Absolutely!

It happens occasionally - the most common example I can think of it getting a license plate or other location from a tractor-trailer (semi) on the highway. Those are very unreliable.

You also sometimes get flags in the wrong countries, immigrants showing their native pride or even embassies.

SamPatt•3h ago

Author here - it's a fair criticism, and I point it out in the article. However, I kept it in for a few reasons.

I'm trying to show the model's full capabilities for image location generally, not just playing geoguessr specifically. The ability to combine web search with image recognition, iteratively, is powerful.

Also, the web search was only meaningful in the Austria round. It did use it in the Ireland round too, but as you can see by the search terms it used, it already knew the road solely from image recognition.

It beat me in the Colombia round without search at all.

It's worthwhile to do a proper apples and apples comparison - I'll run it again and update the post. But the point was to show how incredibly capable the model is generally, and the lack of search won't change that. Just read the chain of thought, it's incredible!

k4rli•2h ago

It's still as much cheating as googling. Completely irrelevant. Even if it were to beat Blinky, it's not different from googlers/scripters.

SamPatt•42m ago

I disagree. I ran those rounds again, without search this time, and the results were nearly identical:

https://news.ycombinator.com/item?id=43837832

IanCal•1h ago

I tried the image without search and it talked about Dornbirn anyway but ended up choosing Bezau which is really quite close.

edit - the models are also at a disadvantage in a way too, they don't have a map to look at while the pick the location.

SamPatt•9m ago

Yes, I re-ran those rounds and it made the same guesses without search, within 1km I believe.

You're right about not having a map - I cannot imagine trying to line up the Ireland coast round without referencing the map.

LeifCarrotson•1h ago

There's some level at which an AI 'player' goes from being competitive with a human player, matching better-trained human strategy against a more impressive memory, to just a cheaty computer with too much memorization. Finding that limit is the interesting thing about this analysis, IMO!

It's not interesting playing chess against Stockfish 17, even for high-level GMs. It's alien and just crushes every human. Writing down an analysis to 20 move depth, following some lines to 30 or more, would be cheating for humans. It would take way too long (exceeding any time controls and more importantly exceeding the lifetime of the human), a powerful computer can just crunch it in seconds. Referencing a tablebase of endgames for 7 pieces would also be cheating, memorizing 7 terabytes of bitwise layouts is absurd but the computer just stores that on its hard drive.

Human geoguessr players have impressive memories way above baseline with respect to regional infrastructure, geography, trees, road signs, written language, and other details. Likewise, human Jeopardy players know an awful lot of trivia. Once you get to something like Scrabble or chess, it's less and less about knowing words or knowing moves, but more about synthesizing that knowledge intelligently.

One would expect a human to recognize some domain names like, I don't know, osu.edu: lots of people know that's Ohio State University, one of the biggest schools in the US, located in Columbus, Ohio. They don't have to cheat and go to an external resource. One would expect a human (a top human player, at least) to know that taxilinder.at is based in Austria. One would never expect any human to have every business or domain name memorized.

With modern AI models trained on internet data, searching the internet is not that different from querying its own training data.

mrlongroots•1h ago

To reframe your takeaway: you want to benchmark the "system" and see how capable it is. The boundaries of the system are somewhat arbitrary: is it "AI + web" or "only AI", and it is not about fairness as much as about "what do you, the evaluator, want to know".

tshaddox•50m ago

> There's some level at which an AI 'player' goes from being competitive with a human player, matching better-trained human strategy against a more impressive memory, to just a cheaty computer with too much memorization. Finding that limit is the interesting thing about this analysis, IMO!

And a lot of human competitions aren't designed in such a way that the competition even makes sense with "AI." A lot of video games make this pretty obvious. It's relatively simple to build an aimbot in a first-person shooter that can outperform the most skilled humans. Even in ostensibly strategic games like Starcraft, bots can micro in ways that are blatantly impossible for humans and which don't really feel like an impressive display of Starcraft skill.

Another great example was IBM Watson playing Jeopardy! back in 2011. We were supposed to be impressed with Watson's natural language capabilities, but if you know anything about high-level Jeopardy! then you know that all you were really seeing is that robots have better reflexes than humans, which is hardly impressive.

rowanG077•5m ago

You seem indicate you want a computer to beat a human without ever using what a computer is actually good at(large memories, brute force compute etc). That seems a little ridiculous to me. Sure I do agree that the web search is too far, because it's literally cheating. But stockfish is super human at chess, it doesn't really matter that it can do this by leveraging the strengths of a computer.

WhitneyLand•3h ago

As models continue to evolve it may not even need to cheat.

Since web scale data is already part of pre-training this info is in principle available for most businesses without a web search.

The exceptions would be if it’s recently added, or doesn’t appear often enough to generate a significant signal during training, as in this case with a really small business.

It’s not hard to imagine base model knowledge improving to the point where it’s still performing at almost the same level without any web search needed.

ricardo81•2h ago

>isn't playing fair.

the idea of having nth more dimensions of information, readable and ingestible within a short frame of time probably isn't either.

ACS_Solver•1h ago

I just tried (o4-mini-high) and had it come to the wrong conclusion when I asked about the location and date, because it didn't search the web. I have a photo of a bench with a sign mentioning the cancellation of an event due to the Pope's death. It impressively figured out the location but then decided that Pope Francis is alive and the sign is likely a prank, so the photo is from April Fools day.

Then after I explicitly instructed it to search the web to confirm whether the Pope is alive, it found news of his death and corrected its answer, but it was interesting to see how the LLM makes a mistake due to a major recent event being after its cutoff.

CamperBob2•1h ago

To be fair, my local copy of R1 isn't doing any searching at all, but it frequently says "A search suggests..." or something along the lines.

mrcwinn•4h ago

O3 is seriously impressive for coding, as well, with Codex. It seems far superior to 3.7-thinking, although it's also more expensive in my usage.

gizmodo59•4h ago

Agreed. O3 is the best model out there for the tasks Ive tried and coding is a fair chunk of it. Claude 3.7 and Gemini 2.5 pro seems to hallucinate more

weinzierl•4h ago

I tried it with a couple of holiday shots and couple of shots from my window and it is nothing but amazing.

That being said I noticed two things that probably hamper its performance - or make its current performance even more amazing - depending how you look at it:

- It often tries to zoom in to decipher even minuscle text. This works brilliantly. Sometimes it tries to enhance contrast by turning the image into black and white with various threshold levels to improve the results, but in my examples it always went in the wrong direction. For example the text was blown out white, it failed, it turned it even ligher instead of darker, failed again, turned it into a white rectangle and gave up on the approach.

- It seems not to have any access to Google Maps or even Open Street Maps and therefore fails to recognize steet patterns. This is even more baffling than the first point, because it is so unlike how I suppose human geo guessers work.

kazinator•4h ago

This seems like a really silly category in which to be competing against machines.

Machine learning could index million or faces, and then identify members of that set from pictures. Could you memorize millions of people, to be able to put a name to a face?

Why not also compete againt grep -r to see who can find matches for a regex faster across your filesystem.

bongodongobob•3h ago

But that's not what it is doing and why this is cool.

Imnimo•4h ago

On the first image, from the model's CoT:

>"I also notice Cyrillic text on a sign"

Am I missing this somewhere? Is the model hallucinating this?

I'd also be very interested to see a comparison against 4o. 4o was already quite good at GeoGuessr-style tasks. How big of a jump is o3?

plyptepe•3h ago

Turn left and look at the post, there should be a Cyrillic text with a 3 underneath on the closest pole to you.

Imnimo•3h ago

Oh I see, I had missed that o3 saw a second view of the scene, not just the screenshot in the write-up.

j3s•4h ago

isn't anyone else horrified by this? the implication is that given an arbitrary picture, chatgpt can give you a very likely approximate location - expert level doxxing is in the hands of anyone with access to a chatgpt subscription.

feels terrifying, especially for women.

turtlebits•4h ago

It needs a lot of context. If its a private picture, it won't have enough information. I gave it a picture I took of my yard and it's guess spanned several US states.

If its out in public, fair game?

sr-latch•4h ago

the way i see it, before these tools, only someone with a lot of resources (or skills) could track down a location from a picture. now, anyone can do it.

the best case outcome is people become more aware of the privacy implications of posting photos online

micromacrofoot•4h ago

it wasn't that hard before, I've taught it to children, it's just that technical skills of the average person are incredibly low

llms are basically shortcutting a wide swath of easily obtainable skills that many people simply haven't cared to learn

usaar333•4h ago

Been true since gpt-4.

echelon•4h ago

I think this is incredibly cool. As with many things, the good cases will outnumber the bad.

This was always possible, it just wasn't widely distributed.

Having a first class ability to effectively geocode an image feels like it connects the world better. You'll be able to snapshot a movie and find where a scene was filmed, revisit places from old photographs, find where interesting locations in print media are, places that designers and creatives used in their (typically exif-stripped) work, etc.

Imagine when we get this for architecture and nature. Or even more broadly, databases of food from restaurants. Products. Clothing and fashion. You name it.

Imagine precision visual search for everything - that'd be amazing.

daemonologist•3h ago

Keep in mind that this is o3 + web search against a human without web search. A sufficiently motivated person with access to your entire social media history, Google Earth and Streetview, etc. would outperform this significantly and could pinpoint almost any inhabited location with coverage.

If you watch Linus Tech Tips, you may have noticed that when he films at his house everything is blurred out to keep people from locating it - here's a recent example: https://www.youtube.com/watch?v=TD_RYb7m4Pw

All that to say, unfortunately doxxing is already really hard to protect against. I don't think o3's capability makes the threat any harder to protect against, although it might lower the bar to entry somewhat.

mopenstein•3h ago

Why especially women? Is the only thing stopping a person from being harmed is that their location isn't known? Especially women?

derfnugget•4h ago

"These models have more than an individual mind could conceivably memorize."

...so what? Is memorization considered intelligence? Calculators have similar properties.

GeoGuessr is the modern nerds' Rubix Cube. The latest in "explore the world without risk of a sunburn".

SamPatt•2h ago

Geoguessr is great fun, try it sometime.

arm32•4h ago

GeoGuessr aside, I really hope that this tech will be able to help save kids someday, e.g. help with FBI's ECAP (https://www.fbi.gov/wanted/ecap).

thrance•3h ago

I wouldn't put too much hope on this technology bringing more good than harm to the world.

ketzo•3h ago

If we don’t actively try to identify and implement positive use cases, then yes, it’ll definitely bring more harm than good.

Isn’t that all the more reason to call out our high hopes?

thrance•2h ago

I don't know what in my comment made you think I was opposed to seeking positive applications of this technology.

From the guidelines:

> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.

fkyoureadthedoc•1h ago

Oh, we're guidelines posting?

> Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative.

mopenstein•3h ago

But it will bring some percentage of good and some percentage of bad. Which ain't half bad, if you ask me.

moritzwarhier•3h ago

What a quip! What if it's 51% bad?

martinsnow•3h ago

What do you do when it flags you or someone you know who's innocent? Blindly trusting these models without any verification will put innocent people in prison. Normal people don't understand why they are so confident. They're confident because they believe all the data they have is correct. I forsee a future with many faux trials because they don't understand critical thinking.

RussianCow•2h ago

> Blindly trusting these models without any verification will put innocent people in prison.

I don't think anybody is suggesting this. But if the models can gleam information/insights that humans can't, that's still valuable, even if it's wrong some percentage of the time.

snowe2010•1h ago

This is what happened with dna testing at the beginning. Prosecutors claimed it was x percentage accurate when in fact it was hilariously inaccurate. People thought the data was valuable when it wasn’t.

8organicbits•1h ago

If you are interested in the history of pseudoscience in the courtroom and methods for deciding what should be permitted in court, see: https://en.m.wikipedia.org/wiki/Daubert_standard

It is, and will continue to be, a hard problem.

mkoubaa•3h ago

The bad is already priced in. Nothing wrong with hoping for more good.

parsimo2010•1h ago

Looking at those photos, those are some crazy hard pictures- masked regions of the image, partially cropped faces, blurry, pictures of insides of rooms. I don't think any current LLM is going to be able to Sherlock Holmes their way into finding any of those people.

Maybe they will one day if there's a model trained on a facial recognition database with every living person included.

jvvw•4h ago

I'm Master level at Geoguessr - it's a rank where you have to definitely know what you are doing but it isn't as high as it probably sounds from the headline.

Masters is about 800-1200 ELO whereas the pros are 1900-2000ish. I'll know the country straight away on 95% of rounds but I can still have no idea where I am in Russia or Brazil sometimes if there's no info. Scripters can definitely beat me!

paulcole•3h ago

Gotta learn your Brazilian soil!

windowshopping•3h ago

Was it worth it?

rosstex•3h ago

I have 2000+ hours in Team Fortress 2. Was it worth it?

Cyph0n•3h ago

Yes, it was. Granted, I probably have more than that.

make3•3h ago

it's a game, that's like asking why a public service is not profitable

650REDHAIR•3h ago

Yes? It’s fun.

SamPatt•2h ago

Yeah I added a "My skill level" section to talk through that. I'm far from a professional.

But I know enough to be able to determine if the chain of thought it outputs is nonsense or comparable to a good human player. I found it remarkable!

karaterobot•3h ago

I don't really follow OSINT, but I occasionally enjoy the fruits of that labor. I assume these models are all in heavy rotation for identifying a location based on an imperfect photograph. What are other practical implications of a model being better than a human at this?

Sam6late•3h ago

I was wondering if this helps in detecting current spots from old aerial videos, say San Francisco in 2002, how cool would it be to juxtapose both in a new video, San Francisco in 2002: https://www.youtube.com/watch?v=vTR6iftL7yE

or Dubai in 1997 https://www.youtube.com/watch?v=JMNXXiiDRhM

sixtram•2h ago

I'm wondering if you feed all the Google street map photos into a special ML designed just for that, how important could that be for say the CIA or FBI?

inetknght•2h ago

You'd have to be crazy naive to think three-letter agencies haven't already thought of it.

That is: it's extremely valuable to them.

godelski•2h ago

There's two important things here to consider when reading:

1) O3 cheated by using Google search. This is both against the rules of the game and OP didn't use search either

2) OP was much quicker. They didn't record their time but if their final summary is accurate then they were much faster.

It's an apples to oranges comparison. They're both fruit and round, but you're ignoring obvious differences. You're cherry picking.

The title is fraudulent as you can't make a claim like that when one party cheats.

I would find it surprising if OP didn't know these rules considering their credentials. Doing this kind of clickbait completely undermines a playful study like this.

Certainly O3 is impressive, but by over exaggerating its capabilities you taint any impressive feats with deception. It's far better to under sell than over sell. If it's better than expected people are happier, even if the thing is crap. But if you over sell people are angry and feel cheated, even if the thing is revolutionary. I don't know why we insist on doing this in tech, but if you're wondering why so many people hate "tech bros", this is one of the reasons. There's no reason to lie here either! Come on! We can't just normalize this behavior. It's just creating a reasonable expectation for people to be distrusting of technology and anything tech people say. It's pretty fucked up. And no, I don't think "it's just a blog post" makes it any better. It makes it worse, because it normalizes the behavior. There's other reasons to distrust big corporations, I don't want to live in a world where we should have our guards up all the time.

SamPatt•58m ago

>1) O3 cheated by using Google search. This is both against the rules of the game and OP didn't use search either

I re-ran it without search, and it made no difference:

https://news.ycombinator.com/item?id=43837832

>2) OP was much quicker. They didn't record their time but if their final summary is accurate then they were much faster.

Correct. This was the second bullet point of my conclusion:

>Humans still hold a big edge in decision time—most of my guesses were < 2 min, o3 often took > 4 min.”

I genuinely don't believe that I'm exaggerating or this is clickbait. The o3 geolocation capability astounded me, and I wanted to share my awe with others.

shihabkhanbd•2h ago

The most interesting thing to me is how well AI and GeoGuessr fit together. Their specialty is recognizing patterns in large amounts of data which is exactly how human players play the game as well, just probably with faster and more capable recall abilities.

simianparrot•2h ago

I too can beat a master level GeoGuessr if I’m allowed to cheat. Please add that info to the headline and be honest.

jampa•2h ago

I was trying to play with o3 this week to see how close it can identify things, and, interestingly, it tries more pattern matching than its own "logic deduction". For example, it can easily deduce any of my photos from Europe and the US because there are many pictures online that I can search for and see similar pictures.

However, when there are not many photos of the place online, it gets closer but stops seeking deeper into it and instead tries to pattern-match things in its corpus / internet.

One example was an island's popular trail that no longer exists. It has been overgrown since 2020. It said first that the rocks are typical of those of an island and the vegetation is from Brazil, but then it ignored its hunch and tried to look for places in Rio de Janeiro.

Another one was a popular beach known for its natural pools during low tides. I took a photo during high tide, when no one posts pictures. It captured the vegetation and the state correctly. But then it started to search for more popular places elsewhere again.

ksec•2h ago

>But several comments intrigued me:

>>I wonder What happened if you put fake EXIF information and asking it to do the same. ( We are deliberately misleading the LLM )

Yay. That was me [1] which was actually downvoted for most of its time. But Thank You for testing out my theory.

What I realised over the years is that comments do get read by people and do shape other people's thought.

I honestly dont think looking up online is cheating. May be in terms of the game. But in real life situation which is most of the time it is absolutely the right thing to do. The chains of thought is scary. I still dont know anything about how AI works other than old garbage in, garbage out. But CoT is definitely something else. Even though the author said it is sometimes doing needless work, but in terms of computing resources I am not even sure if it matters as long as it is accurate. And it is another proof that may be, just may be AI taking over the world is much closer than I imagined.

[1] https://news.ycombinator.com/item?id=43803985

parsimo2010•1h ago

My comment from the previous post:

> I’m sure there are areas where the location guessing can be scary accurate, like the article managed to guess the exact town as its backup guess. But seeing the chain of thought, I’m confident there are many areas that it will be far less precise. Show it a picture of a trailer park somewhere in Kansas (exclude any signs with the trailer park name and location) and I’ll bet the model only manages to guess the state correctly.

This post, while not a big sample size, reflects how I would expect these models to perform. The model managed to be reliable with guessing the right country, even in pictures without a lot of visual information (I'll claim that getting the country correct in Europe is roughly equivalent to guessing the right state in the USA). It does sometimes manage to get the correct town, but this is not a reliable level of accuracy. The previous article only tested on one picture and it happened to get the correct town as its second guess and the author called it "scary accurate." I suppose that's a judgement call. To me, I've grown to expect that people can identify what country I'm in from a variety of things (IP address, my manner of speech, name, etc.), so I don't think that is "scary."

I will acknowledge that o3 with web search enabled seems capable of playing GeoGuessr at a high level, because that is less of a judgement call. What I want to see now is an o3 GeoGuessr bot to play many matches and see what its ELO is.

SamPatt•1h ago

Author here, I'm glad to see folks find this interesting.

I encourage everyone to try Geoguessr! I love it.

I'm seeing a lot of comments saying that the fact that the o3 model used web search in 2 of 5 rounds made this unfair, and the results invalid.

To determine if that's true, I re-ran the two rounds where o3 used search, and I've updated the post with the results.

Bottom line: It changed nothing. The guesses were nearly identical. You can verify the GPS coordinates in the post.

Here's an example of why it didn't matter. In the Austria round, check out how the model identifies the city based on the mountain in the background:

https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...

It already has so much information that it doesn't need the search.

Would search ever be useful? Of course it would. But in this particular case, it was irrelevant.

bjourne•44m ago

What's your take on man vs. machine? If AI already beats Master level players it seem certain that it will soon beat the Geoguessr world champion too. Will people still derive pleasure from playing it, like with chess?

SamPatt•35m ago

>Will people still derive pleasure from playing it, like with chess?

Exactly - I see it just like chess, which I also play and enjoy.

The only problem is cheating. I don't have an answer for that, except right now it's too slow to do that effectively, at least consistently.

Otherwise, I don't care that a machine is better than I am.

Trump: "Jeff Bezos is nice guy"

The Telepathic Computer's AI problem and what Siri should be

The fewer the merrier: The merits of unified land ownership

Law Enforcement Can Break 77% of 'Three Random Word' Passwords

Throwing it all away – how extreme rewriting changed the way I build databases

Mastercard Gives AI Agents Ability to Shop Online for You

Master Plan, Part Deux (2016)

Show HN: Crowdfunding app built using just Flask and SQLite

How to get in the flow while coding (and why it's important)

Influencers blur political content – campaigning line, swaying elections

OpenAI rolls back update that made ChatGPT a sycophantic mess

A project to bring back AOL, AIM, ICQ, Yahoo Messenger, and even Q-Link

Can your IC design system do this for free?

Protecting Windows Users from Janet Jackson's Rhythm Nation

Computer Aids for VLSI Design

Amazon Rules Out Displaying Tariff Impact After White House Attack

Prompt Library with 500 prompt engineered prompts

What is bug hunting and why is it changing?

How much has Elon Musk's Doge cut from US Government spending?

Realizing America's Drone Revolution

Swiftwave: Self-hosted lightweight PaaS solution to deploy and manage your apps

Microsoft Confirms $1.50 Windows Security Update Hotpatch Fee

Mark Zuckerberg – Llama 4, DeepSeek, AI Friends, & Race to AGI [video]

Wikipedia's LTA (Long Term Abuse) List

Comparison with Traditional Mathematics

A tool for migrating and optimizing prompts from other LLMs to Llama

Do imports of cheap solar panels help or hurt domestic jobs in clean energy?

Only Teslas Exempt from New Auto Tariffs Thanks to 85% Domestic Content Rule

Impact, Agency, and Taste

Google Play unable to complete identity verification

Trump: "Jeff Bezos is nice guy"

The Telepathic Computer's AI problem and what Siri should be

The fewer the merrier: The merits of unified land ownership

Law Enforcement Can Break 77% of 'Three Random Word' Passwords

Throwing it all away – how extreme rewriting changed the way I build databases

Mastercard Gives AI Agents Ability to Shop Online for You

Master Plan, Part Deux (2016)

Show HN: Crowdfunding app built using just Flask and SQLite

How to get in the flow while coding (and why it's important)

Influencers blur political content – campaigning line, swaying elections

OpenAI rolls back update that made ChatGPT a sycophantic mess

A project to bring back AOL, AIM, ICQ, Yahoo Messenger, and even Q-Link

Can your IC design system do this for free?

Protecting Windows Users from Janet Jackson's Rhythm Nation

Computer Aids for VLSI Design

Amazon Rules Out Displaying Tariff Impact After White House Attack

Prompt Library with 500 prompt engineered prompts

What is bug hunting and why is it changing?

How much has Elon Musk's Doge cut from US Government spending?

Realizing America's Drone Revolution

Swiftwave: Self-hosted lightweight PaaS solution to deploy and manage your apps

Microsoft Confirms $1.50 Windows Security Update Hotpatch Fee

Mark Zuckerberg – Llama 4, DeepSeek, AI Friends, & Race to AGI [video]

Wikipedia's LTA (Long Term Abuse) List

Comparison with Traditional Mathematics

A tool for migrating and optimizing prompts from other LLMs to Llama

Do imports of cheap solar panels help or hurt domestic jobs in clean energy?

Only Teslas Exempt from New Auto Tariffs Thanks to 85% Domestic Content Rule

Impact, Agency, and Taste

Google Play unable to complete identity verification

O3 beats a master-level GeoGuessr player, even with fake EXIF data

Comments