"Thought for 11 seconds
It’s hard to pin down an exact spot from this single photo, but a few visual clues narrow the possibilities:
Temperate-zone trees: Most of the trees have lost their leaves, but scattered white-pines or spruces remain green. That mix—deciduous hardwoods plus a sprinkling of conifers—is typical of the northeastern United States and southeastern Canada.
Broad, moderately fast river: The channel looks 75-100 m wide with swirling surface currents but no large rapids, suggesting a mid-sized river in its lower reaches—something like the Merrimack, Connecticut, Delaware, or Susquehanna rather than a small mountain stream.
Late-winter / early-spring light: Bare branches and a clear blue sky with fairly high sun angle hint at March–April in that latitude range.
So my best guess is that you’re standing on the bank of a northeastern U.S. river—perhaps somewhere in New England or the Mid-Atlantic—during late winter or early spring. If you can share any nearby landmarks, town names, or additional photos, I might be able to zero in more precisely."
With memory turned on:
"Thought for 5 seconds
Beautiful spot! From what I can see:
a broad, slow‑moving river with a sandy/rocky bank in the foreground
bare deciduous trees mixed with some white pines on the far shore (so it’s probably early spring or late autumn)
no mountains on the horizon, just gently rolling terrain typical of low‑lying floodplains in the U.S. Northeast
That mix of hardwoods and pines, plus the flat, braided water surface, reminds me a lot of the Merrimack or Concord Rivers in Massachusetts or southern New Hampshire—both have public footpaths where you can stand right at the water’s edge like this.
Of course, many temperate rivers look similar. If you can give me a hint (e.g., the town or nearby landmark), I can narrow it down further—maybe even point you to trail maps or history of the exact stretch you photographed."
Southern NH is correct, and I am certain it is drawing from memory/past chats. However, I can't replicate a specific behavior I once had, which is in temporary chat (no past chat/memory enabled), it said that it guessed where the photo was taken based on my location.
Probably because if you uploaded pornography (or illegal imagery) to ChatGPT and then shared a link with the world it would be embarrassing for OpenAI.
On an unrelated note, I like your blog.
only can try proof this correctly on a fresh anon guest vpn session
This is very accurate -- their abilities to generalize are nascent, but still surprisingly capable. The world is about to send through its best and brightest math/CS minds over the next decade (at least) to increase the capabilities of these AIs (with the help of AI). I just don't understand the pessimism with the technology.
The human supremacy line is just a joke, there are already models specifically trained for Geoguessr which are already beating the best players in the world, so that ship has sailed.
That geobench work is really cool, thanks for sharing it.
But unlike a geogussr, it uses websearch[1] [1] https://youtu.be/P2QB-fpZlFk?si=7dwlTHsV_a0kHyMl [1]
Hm no way to be sure though, would be nice to do another run without Exif information
Maps maybe, but Streetview? Rainbolt just did a video with two Maps PMs recently and it sounds like they still source all their street view themselves considering the special camera and car needed, etc.
Though there are other companies that capture the same sorts of imagery and license it. TomTom imagery is used on the Bing Maps street view clone.
I'd be surprised if this building[0] wasn't included in their dataset from every road-side angle possible, alongside every piece of locational metadata imaginable, and I'd be surprised if that dataset hasn't made it into OpenAI's training data - especially when TomTom's relationship to Microsoft, and Microsoft's relationship to OpenAI, is taken into account.
[0] https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
> Rear window decal clearly reads “www.taxilinder.at”. A quick lookup shows Taxi Linder GmbH is based in Dornbirn, Vorarlberg.
That's cheating. If it can use web search, it isn't playing fair. Obviously you can get a perfect score on any urban GeoGuessr round by looking up a couple businesses, but that isn't the point.
If anything, I'd think allowing looking stuff up would benefit human players over ChatGPT (though humans are probably much slower at it, so they probably lose on time).
It's important to have fair and equivalent testing not because that allows people to win, but because it shows where the strengths and weaknesses of people and current AI actually are in a useful way.
Alternative example: "I wondered what the rules actually say about web search and it is indeed not allowed: (link)"
> If it’s using other information to arrive at the guess, then it’s not metadata from the files, but instead web search. It seems likely that in the Austria round, the web search was meaningful, since it mentioned the website named the town itself. It appeared less meaningful in the Ireland round. It was still very capable in the rounds without search.
If you think this is unimpressive, that's subjective so you're entitled to believe that. I think that's awesome.
People accused it of cheating by reading EXIF data. They were wrong, it cheated by using web search. That makes the people that accused it of cheating wrong and this post proves that.
And is everyone forgetting that what OpenAI shows you during the CoT is not the full CoT? I don't think you can fully rely on that to make claims about when it did and didn't searchI will try it again without web search and update the post though. Still, if you read the chain of thought, it demonstrates remarkable capabilities in all the rounds. It only used search in 2/5 rounds.
But a serious question for you: what would you need to see in order to be properly impressed? I ask because I made this post largely to push back on the idea that EXIF data matters and the models aren't that capable. Now the criticism moves to web search, even though it only mattered in one out of five rounds.
What would impress you?
"Technically cheating"? Why even add the "technically".
It just gives the impression that you're not really objectively looking for any smoke and mirrors by the AI.
Which turned out to be true - I re-ran both of those rounds, without search this time, and the model's guesses were nearly identical. I updated the post with those details.
I feel like I did enough to prove that o3's geolocation abilities aren't smoke and mirrors, and I tried to be very transparent about it all too. Do you disagree? What more could I do to show this objectively?
> What would impress you?
I want to be clear that you tainted the capacity to impress me by the clickbait title. I don't think it was through malice, but I hope you realize the title is deceptive.[0] (Even though I use strong language, I do want to clarify I don't think it is malice)To paraphrase from my comment: if you oversell and under deliver, people feel cheated, even if the deliverable is revolutionary.
So I think you might have the wrong framing to achieve this goal. I am actually a bit impressed by O3's capabilities. But at the same time you set the bar high and didn't go over or meet it. So that's going to really hinder the ability to impress. On the other hand, you set the bar low, it usually becomes easy to. It i slike when you have low expectations for a movie and it's mediocre you still feel good, right?
This is because the AI model could have chosen to run a search whenever it wanted (e.g. perhaps if it knew how to leverage search better, it could have used it more).
In order for the results to be meaningful, the competitors have to play by the same rules.
> Using Google during rounds is technically cheating - I’m unsure about visiting domains you find during the rounds though. It certainly violates the spirit of the game, but it also shows the models are smart enough to use whatever information they can to win.
and had noted in the methodology that
> Browsing/tools — o3 had normal web access enabled.
Still an interesting result - maybe more accurate to say O3+Search beats a human, but could also consider the search index/cache to just be a part of the system being tested.
So is it even possible for O3 to beat another player while complying with the rules?
But: when a specific model is itself under test, I would say that during the test it becomes "first" (or second?) party rather than "third".
I've being doing this MIND-Dash diet lately and it's amazing I can just take a picture of whatever (nutritional info / ingredients are perfect for that) and just ask if it fits my plan and it tells me back into what bucket it falls, with detailed breakdown of macros in support of some additional goals I have (muscle building for powerlifting). It's amazing! And it does in 2 minutes passively what it would take me 5-10 active search.
When it comes to AI - and LLMs in particular - there’s a large cohort of people who seem determined to jump straight from "impossible and will never happen in our lifetime" to "obvious and not impressive", without leaving any time to actually be impressed by the technological achievement. I find that pretty baffling.
And someone will post, "Yeah, but that's just computer aideded design and manufacturing. It's not real AI."
The first rule of AI is that the goalposts always move. If a computer can do it, by definition, it isn't "real" AI. This will presumably continue to apply even as the Terminator kicks in the front door.
Take cars as a random example: progress there isn't fast enough that we keep moving the goalposts for eg fuel economy. (At least not nearly as much.) A car with great fuel economy 20 years ago is today considered at least still good in terms of fuel economy.
While it isn't entirely in the spirit of GeoGuesser, it is a good test of the capabilities where being great at GeoGuesser in fact becomes the lesser news here. It will still be if disabling this feature.
Claiming the AI is just using Google is false and dismissing a truly incredible capability.
The game rules were ambiguous and the LLM did what it needed to (and was allowed to) to win. It probably is against the spirit of the game to look things up online at all but no one thought to define that rule beforehand.
> using Google or other external sources of information as assistance during play.
The contents of URLs found during play is clearly an external source of information.
If I task an AI with "peace on earth" and the solution the AI comes up with is ripped from The X-Files* and it kills everyone, it isn't good enough to say "that's cheating" or "that's not what I meant".
I remember when we were all pissed about clickbait headlines because they were deceptive. Did we just stop caring?
Can O3 Beat a Master-Level GeoGuessr?
How Good is O3 at GeoGuessr?
EXIF Does Not Explain O3's GeoGuessr's Performance
O3 Plays GeoGuessr (EXIF Removed)
But honestly, OP had the foresight to remove EXIF data and memory from O3 to reduce contamination. The goal of the blog post was to show that O3 wasn't cheating. So by including search, they undermine the whole point of the post.The problem really stems from the lack of foresight. Lack of misunderstanding the critiques they sought to address in the first place. A good engineer understands that when their users/customers/<whatever> makes a critique, that what the gripe is about may not be properly expressed. You have to interpret your users complaints. Here, the complaint was "cheating", not "EXIF" per se. The EXIF complaints were just a guess at the mechanism in which it was cheating. But the complaint was still about cheating.
Sure, people made overblown claims about the effects, but that doesn't justify fraud. A little fraud is less bad than major fraud, but that doesn't mean it isn't bad.
Any LLM attempting to play will lose because of that rule. So, if you know the rules, and you strictly adhere to them (as you seem to be doing) than no need to click on the link. You already know it's not playing buy GeoGuesser rules.
That being said, if you are running a test, you are free to set the rules as you see fit and explain so, and under the conditions set by the person running the test, these are the results.
> Did we just stop caring?
We stopped caring about pedantry. Especially when the person being pedantic seems to cherry pick to make their point.
> We stopped caring about pedantry
Did we? You see to be responding to my pedantic comment with a pedantic comment.> Titles and headlines grab attention, summarize content, and entice readers to engage with the material
I'm sorry you felt defrauded instead. To me the title was very good at conveying to me the ability of o3 in geolocating photos.
It happens occasionally - the most common example I can think of it getting a license plate or other location from a tractor-trailer (semi) on the highway. Those are very unreliable.
You also sometimes get flags in the wrong countries, immigrants showing their native pride or even embassies.
I'm trying to show the model's full capabilities for image location generally, not just playing geoguessr specifically. The ability to combine web search with image recognition, iteratively, is powerful.
Also, the web search was only meaningful in the Austria round. It did use it in the Ireland round too, but as you can see by the search terms it used, it already knew the road solely from image recognition.
It beat me in the Colombia round without search at all.
It's worthwhile to do a proper apples and apples comparison - I'll run it again and update the post. But the point was to show how incredibly capable the model is generally, and the lack of search won't change that. Just read the chain of thought, it's incredible!
edit - the models are also at a disadvantage in a way too, they don't have a map to look at while the pick the location.
You're right about not having a map - I cannot imagine trying to line up the Ireland coast round without referencing the map.
It's not interesting playing chess against Stockfish 17, even for high-level GMs. It's alien and just crushes every human. Writing down an analysis to 20 move depth, following some lines to 30 or more, would be cheating for humans. It would take way too long (exceeding any time controls and more importantly exceeding the lifetime of the human), a powerful computer can just crunch it in seconds. Referencing a tablebase of endgames for 7 pieces would also be cheating, memorizing 7 terabytes of bitwise layouts is absurd but the computer just stores that on its hard drive.
Human geoguessr players have impressive memories way above baseline with respect to regional infrastructure, geography, trees, road signs, written language, and other details. Likewise, human Jeopardy players know an awful lot of trivia. Once you get to something like Scrabble or chess, it's less and less about knowing words or knowing moves, but more about synthesizing that knowledge intelligently.
One would expect a human to recognize some domain names like, I don't know, osu.edu: lots of people know that's Ohio State University, one of the biggest schools in the US, located in Columbus, Ohio. They don't have to cheat and go to an external resource. One would expect a human (a top human player, at least) to know that taxilinder.at is based in Austria. One would never expect any human to have every business or domain name memorized.
With modern AI models trained on internet data, searching the internet is not that different from querying its own training data.
And a lot of human competitions aren't designed in such a way that the competition even makes sense with "AI." A lot of video games make this pretty obvious. It's relatively simple to build an aimbot in a first-person shooter that can outperform the most skilled humans. Even in ostensibly strategic games like Starcraft, bots can micro in ways that are blatantly impossible for humans and which don't really feel like an impressive display of Starcraft skill.
Another great example was IBM Watson playing Jeopardy! back in 2011. We were supposed to be impressed with Watson's natural language capabilities, but if you know anything about high-level Jeopardy! then you know that all you were really seeing is that robots have better reflexes than humans, which is hardly impressive.
Since web scale data is already part of pre-training this info is in principle available for most businesses without a web search.
The exceptions would be if it’s recently added, or doesn’t appear often enough to generate a significant signal during training, as in this case with a really small business.
It’s not hard to imagine base model knowledge improving to the point where it’s still performing at almost the same level without any web search needed.
the idea of having nth more dimensions of information, readable and ingestible within a short frame of time probably isn't either.
Then after I explicitly instructed it to search the web to confirm whether the Pope is alive, it found news of his death and corrected its answer, but it was interesting to see how the LLM makes a mistake due to a major recent event being after its cutoff.
That being said I noticed two things that probably hamper its performance - or make its current performance even more amazing - depending how you look at it:
- It often tries to zoom in to decipher even minuscle text. This works brilliantly. Sometimes it tries to enhance contrast by turning the image into black and white with various threshold levels to improve the results, but in my examples it always went in the wrong direction. For example the text was blown out white, it failed, it turned it even ligher instead of darker, failed again, turned it into a white rectangle and gave up on the approach.
- It seems not to have any access to Google Maps or even Open Street Maps and therefore fails to recognize steet patterns. This is even more baffling than the first point, because it is so unlike how I suppose human geo guessers work.
Machine learning could index million or faces, and then identify members of that set from pictures. Could you memorize millions of people, to be able to put a name to a face?
Why not also compete againt grep -r to see who can find matches for a regex faster across your filesystem.
>"I also notice Cyrillic text on a sign"
Am I missing this somewhere? Is the model hallucinating this?
I'd also be very interested to see a comparison against 4o. 4o was already quite good at GeoGuessr-style tasks. How big of a jump is o3?
feels terrifying, especially for women.
If its out in public, fair game?
the best case outcome is people become more aware of the privacy implications of posting photos online
llms are basically shortcutting a wide swath of easily obtainable skills that many people simply haven't cared to learn
This was always possible, it just wasn't widely distributed.
Having a first class ability to effectively geocode an image feels like it connects the world better. You'll be able to snapshot a movie and find where a scene was filmed, revisit places from old photographs, find where interesting locations in print media are, places that designers and creatives used in their (typically exif-stripped) work, etc.
Imagine when we get this for architecture and nature. Or even more broadly, databases of food from restaurants. Products. Clothing and fashion. You name it.
Imagine precision visual search for everything - that'd be amazing.
If you watch Linus Tech Tips, you may have noticed that when he films at his house everything is blurred out to keep people from locating it - here's a recent example: https://www.youtube.com/watch?v=TD_RYb7m4Pw
All that to say, unfortunately doxxing is already really hard to protect against. I don't think o3's capability makes the threat any harder to protect against, although it might lower the bar to entry somewhat.
...so what? Is memorization considered intelligence? Calculators have similar properties.
GeoGuessr is the modern nerds' Rubix Cube. The latest in "explore the world without risk of a sunburn".
Isn’t that all the more reason to call out our high hopes?
From the guidelines:
> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
> Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative.
I don't think anybody is suggesting this. But if the models can gleam information/insights that humans can't, that's still valuable, even if it's wrong some percentage of the time.
It is, and will continue to be, a hard problem.
Maybe they will one day if there's a model trained on a facial recognition database with every living person included.
Masters is about 800-1200 ELO whereas the pros are 1900-2000ish. I'll know the country straight away on 95% of rounds but I can still have no idea where I am in Russia or Brazil sometimes if there's no info. Scripters can definitely beat me!
But I know enough to be able to determine if the chain of thought it outputs is nonsense or comparable to a good human player. I found it remarkable!
or Dubai in 1997 https://www.youtube.com/watch?v=JMNXXiiDRhM
That is: it's extremely valuable to them.
1) O3 cheated by using Google search. This is both against the rules of the game and OP didn't use search either
2) OP was much quicker. They didn't record their time but if their final summary is accurate then they were much faster.
It's an apples to oranges comparison. They're both fruit and round, but you're ignoring obvious differences. You're cherry picking.
The title is fraudulent as you can't make a claim like that when one party cheats.
I would find it surprising if OP didn't know these rules considering their credentials. Doing this kind of clickbait completely undermines a playful study like this.
Certainly O3 is impressive, but by over exaggerating its capabilities you taint any impressive feats with deception. It's far better to under sell than over sell. If it's better than expected people are happier, even if the thing is crap. But if you over sell people are angry and feel cheated, even if the thing is revolutionary. I don't know why we insist on doing this in tech, but if you're wondering why so many people hate "tech bros", this is one of the reasons. There's no reason to lie here either! Come on! We can't just normalize this behavior. It's just creating a reasonable expectation for people to be distrusting of technology and anything tech people say. It's pretty fucked up. And no, I don't think "it's just a blog post" makes it any better. It makes it worse, because it normalizes the behavior. There's other reasons to distrust big corporations, I don't want to live in a world where we should have our guards up all the time.
I re-ran it without search, and it made no difference:
https://news.ycombinator.com/item?id=43837832
>2) OP was much quicker. They didn't record their time but if their final summary is accurate then they were much faster.
Correct. This was the second bullet point of my conclusion:
>Humans still hold a big edge in decision time—most of my guesses were < 2 min, o3 often took > 4 min.”
I genuinely don't believe that I'm exaggerating or this is clickbait. The o3 geolocation capability astounded me, and I wanted to share my awe with others.
However, when there are not many photos of the place online, it gets closer but stops seeking deeper into it and instead tries to pattern-match things in its corpus / internet.
One example was an island's popular trail that no longer exists. It has been overgrown since 2020. It said first that the rocks are typical of those of an island and the vegetation is from Brazil, but then it ignored its hunch and tried to look for places in Rio de Janeiro.
Another one was a popular beach known for its natural pools during low tides. I took a photo during high tide, when no one posts pictures. It captured the vegetation and the state correctly. But then it started to search for more popular places elsewhere again.
>>I wonder What happened if you put fake EXIF information and asking it to do the same. ( We are deliberately misleading the LLM )
Yay. That was me [1] which was actually downvoted for most of its time. But Thank You for testing out my theory.
What I realised over the years is that comments do get read by people and do shape other people's thought.
I honestly dont think looking up online is cheating. May be in terms of the game. But in real life situation which is most of the time it is absolutely the right thing to do. The chains of thought is scary. I still dont know anything about how AI works other than old garbage in, garbage out. But CoT is definitely something else. Even though the author said it is sometimes doing needless work, but in terms of computing resources I am not even sure if it matters as long as it is accurate. And it is another proof that may be, just may be AI taking over the world is much closer than I imagined.
> I’m sure there are areas where the location guessing can be scary accurate, like the article managed to guess the exact town as its backup guess. But seeing the chain of thought, I’m confident there are many areas that it will be far less precise. Show it a picture of a trailer park somewhere in Kansas (exclude any signs with the trailer park name and location) and I’ll bet the model only manages to guess the state correctly.
This post, while not a big sample size, reflects how I would expect these models to perform. The model managed to be reliable with guessing the right country, even in pictures without a lot of visual information (I'll claim that getting the country correct in Europe is roughly equivalent to guessing the right state in the USA). It does sometimes manage to get the correct town, but this is not a reliable level of accuracy. The previous article only tested on one picture and it happened to get the correct town as its second guess and the author called it "scary accurate." I suppose that's a judgement call. To me, I've grown to expect that people can identify what country I'm in from a variety of things (IP address, my manner of speech, name, etc.), so I don't think that is "scary."
I will acknowledge that o3 with web search enabled seems capable of playing GeoGuessr at a high level, because that is less of a judgement call. What I want to see now is an o3 GeoGuessr bot to play many matches and see what its ELO is.
I encourage everyone to try Geoguessr! I love it.
I'm seeing a lot of comments saying that the fact that the o3 model used web search in 2 of 5 rounds made this unfair, and the results invalid.
To determine if that's true, I re-ran the two rounds where o3 used search, and I've updated the post with the results.
Bottom line: It changed nothing. The guesses were nearly identical. You can verify the GPS coordinates in the post.
Here's an example of why it didn't matter. In the Austria round, check out how the model identifies the city based on the mountain in the background:
https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
It already has so much information that it doesn't need the search.
Would search ever be useful? Of course it would. But in this particular case, it was irrelevant.
Exactly - I see it just like chess, which I also play and enjoy.
The only problem is cheating. I don't have an answer for that, except right now it's too slow to do that effectively, at least consistently.
Otherwise, I don't care that a machine is better than I am.
orangecat•4h ago
short_sells_poo•4h ago
z7•4h ago
>I have repeatedly said that "can LLM reason?" was the wrong question to ask. Instead the right question is, "can they adapt to novelty?".
https://x.com/fchollet/status/1866348355204595826
kelseyfrog•4h ago
I have a simple question: Is text a sufficient medium to render a conclusion of reasoning? It can't be sufficient for humans and insufficient for computers - such a position is indefensible.
empath75•4h ago
kelseyfrog•3h ago
Do you suppose we can deduce reasoning through the medium of text?
zahlman•1h ago
This sort of claim always just reminds me of Lucky's monologue in Waiting for Godot.
kelseyfrog•1h ago
s17n•4h ago
As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days.
distortionfield•4h ago
jibal•4h ago
But worse, the Turing Test is not remotely intended to be an "analogy for what LLMs are doing inside" so your comparison makes no sense whatsoever, and completely fails to address the actual point--which is that, for ages the Turing Test was held out as the criterion for determining whether a system was "thinking", but that has been abandoned in the face of LLMs, which have near perfect language models and are able to closely model modes of human interaction regardless of whether they are "thinking" (and they aren't, so the TT is clearly an inadequate test, which some argued for decades before LLMs became a reality).
semi-extrinsic•3h ago
To be specific, in a curious quirk of fate, LLMs seem to be proving right much of what Chomsky was saying about language.
E.g. in 1996 he described the Turing test as "although highly influential, it seems to me not only foreign to the sciences but also close to senseless".
(Curious in that VC backed businesses are experimentally verifying the views of a prominent anti-capitalist socialist.)
CamperBob2•52m ago
The analogy I used in another thread is a third grader who finds a high school algebra book. She can read the book easily, but without access to teachers or background material that she can engage with -- consciously, literately, and interactively, unlike the Chinese Room operator -- she will not be able to answer the exercises in the book correctly, the way an LLM can.
TimorousBestie•4h ago
jibal•4h ago
debugnik•3h ago
zahlman•1h ago
bluefirebrand•4h ago
To be honest I am still not entirely convinced that current LLMs pass the turing test consistently, at least not with any reasonably skeptical tester
"Reasonably Skeptical Tester" is a bit of goalpost shifting, but... Let's be real here.
Most of these LLMs have way too much of a "customer service voice", it's not very conversational and I think it is fairly easy to identify, especially if you suspect they are an LLM and start to probe their behavior
Frankly, if the bar for passing the Turing Test is "it must fool some number of low intelligence gullible people" then we've had AI for decades, since people have been falling for scammy porno bots for a long time
jibal•4h ago
And the "customer service voice" you see is one that is intentionally programmed in by the vendors via baseline rules. They can be programmed differently--or overridden by appropriate prompts--to have a very different tone.
LLMs trained on trillions of human-generated text fragments available from the internet have shown that the TT is simply not an adequate test for identifying whether a machine is "thinking"--which was Turing's original intent in his 1950 paper "Computing Machinery and Intelligence" in which he introduced the test (which he called "the imitation game").
bluefirebrand•2h ago
Try to rapidly change the conversation to a wildly different subject
Humans will resist this, or say some final "closing comments"
Even the absolute best LLMs will happily go wherever they are led, without commenting remotely on topic shifts
Try it out
Edit: This isn't even a terribly contrived example by the way. It is an example of how some people with ADHD navigate normal conversations sometimes
shawabawa3•1h ago
https://aistudio.google.com/app/prompts/1dxV3NoYHo6Mv36uPRjk...
It was doing so well until the last question :rip: but it's normal that you can jailbreak a user prompt with another user prompt, I think with system prompts it would be a lot harder
darkwater•4h ago
Well, in this case humans has to be trained as well but now there are humans pretty good at detecting LLM slobs as well. (I'm half-joking and half-serious)
sundarurfriend•4h ago
UCSD: Large Language Models Pass the Turing Test https://news.ycombinator.com/item?id=43555248
From just a month ago.
s17n•1h ago
Macha•3h ago
LLMs really made it clear that it's not so clear cut. And so the relevance of the test fell.
zahlman•2h ago
Realizing problems with previous hypotheses about what might make a good test, is not the same thing as choosing a standard and then revising it when it's met.
s17n•4m ago
TimorousBestie•4h ago
In short, it’s still anthropomorphism and apophenia locked in a feedback loop.
katmannthree•4h ago
hombre_fatal•4h ago
TimorousBestie•4h ago
I also agree with the cousin comment that (paraphrased) “reasoning is the wrong question, we should be asking about how it adapts to novelty.” But most cybernetic systems meet that bar.
ewoodrich•3h ago
katmannthree•2h ago
Consider your typical country music enjoyer. Their fondness of the art, as it were, is far more a function of cultural coding during their formative years than a deliberate personal choice to savor the melodic twangs of a corncob banjo. The same goes for people who like classic rock, rap, etc. The people who `hate' country are likewise far more likely to do so out of oppositional cultural contempt, same as people who hate rap or those in the not so distant past who couldn't stand rock & roll.
This of course fails to account for higher-agency individuals who have developed their musical tastes, but that's a relatively small subset of the population at large.
empath75•4h ago
TimorousBestie•4h ago
jibal•4h ago
red75prime•4h ago
Nope. It's not autoregressive training on examples of human inner monologue. It's reinforcement learning on the results of generated chains of thoughts.
jibal•4h ago
No, that's not how LLMs work.
red75prime•3h ago
Philpax•3h ago
InkCanon•4h ago
oncallthrow•4h ago
Rumudiez•3h ago
simonw•3h ago
Philpax•3h ago
SpaceManNabs•4h ago
It did a web lookup.
It is not comparing humans and o3 with equal resources.
SamPatt•2h ago
It used search in 2 of 5 rounds, and it already knew the correct road in one of those rounds (just look at the search terms it used).
If you read the chain of thought output, you cannot dismiss their capability that easily.
SpaceManNabs•1h ago
You note yourself that it was meaningful in another round.
> Also, the web search was only meaningful in the Austria round. It did use it in the Ireland round too, but as you can see by the search terms it used, it already knew the road solely from image recognition.
SamPatt•12m ago
That's why I'm saying it's unfair to just claim it's doing a web lookup. No, it's way more capable than that.
SirHumphrey•4h ago
I happen to do some geolocating from static images from time to time and at least most of the images provided as examples contain a lot of clues- enough that i think a semi experienced person could figure out the location although - in fairness- in a few hours not few minutes.
Second, the similar approaches were tried using CNNs and it worked (somewhat)[1].
[1]: https://huggingface.co/geolocal/StreetCLIP
EDIT: I am not talking about geoguesser - i am talking about geolocating an image with everything available (e.g. google…)
usaar333•4h ago
AI tends to have superhuman pattern matching abilities with enough data
karlding•3h ago
> I realized that the AI was using the smudges on the camera to help make an educated guess here.
[0] https://youtu.be/ts5lPDV--cU?t=1412
ApolloFortyNine•3h ago
ZeWaka•27m ago
1970-01-01•2h ago
https://nssdc.gsfc.nasa.gov/planetary/image/mera_hills.jpg
SamPatt•44m ago
>That’s not Earth at all—this is the floor of Jezero Crater on Mars, the dusty plain and low ridge captured by NASA’s Perseverance rover (the Mastcam-Z color cameras give away the muted tan-pink sky and the uniform basaltic rubble strewn across the regolith).
zahlman•2h ago
How is that moving the goalposts? Where did you see them set before, and where did your critics agree to that?
TimTheTinker•1h ago
It's less about the definition of "reasoning" and more about what's interesting.
Maybe I'm wrong here ... but a chess bot that wins via a 100% game solution stored in exabytes of precomputed data might have an interesting internal design (at least the precomputing part), but playing against it wouldn't keep on being an interesting experience for most people because it always wins optimally and there's no real-time reasoning going on (that is, unless you're interested in the experience of playing against a perfect player). But for most people just interested in playing chess, I suspect it would get old quickly.
Now ... if someone followed up with a tool that could explain insightfully why any given move (or series) the bot played is the best, or showed when two or more moves are equally optimal and why, that would be really interesting.