I had a very brief window where gpt-5 was really good/fast on cursor-agent day of launch.
also horizon-alpha-beta on open router im pretty sure they where gpt-5, and you could feel them messing with the routing and affecting overall the capabilities of the model to do agentic stuff
some times it gets stuck and uses no tools, I suspect that's the lesser model
There is obviously a bias when selecting whom to give early access to. I'd love to see counterexamples to that though.
GPT-5 still gets this wrong occasionally. Source: I just asked it: How many r's are in "strawberry"?
It said 2.
(I dislike this method of testing LLMs, as it exploits a very specific and quirky limitation they have, rather than assessing their general usefulness. But still, I couldn't resist.)
The worse an LLM is, the more likely it is to suggest literally impossible actions in the method, like “turn the card over twice to show that it now has three sides. Your spectators can examine the three-sided card.” It can’t tell logic from fantasy, or method from effect.
But it's all context after the fact. There's very little an LLM is going to have about that context as you rightly pointed out.
Therefore, the correct prompt is "write a python program to count the number of letters in a word, and then use it to count the number of Rs in strawberry".
That took 4 seconds
What a waste of resources
>> 3
This was for GPT-5 regular
I'm curious if that second sentence true or not. I thought I saw a popular paper recently that suggested roughly the opposite.
I do think the vibecoding tools are good at spitting out well-defined CRUD apps, but more creative things are still rough without experienced hands to guide things along.
The first 80% is easy, but the second 80% is hard.
I've just recently set off time to have a few extended coding sessions, and the results are all over the place.
Guided in a good way for well defined tasks, it has saved me days if not weeks. Given more vague, or perhaps unreasonable tasks, it will quickly devolve into just delivering something, anything, no matter how "obviously" wrong it is.
To be fair, I bet you were surprised.
Rarely can you get the recipients to admit to the latter...
> I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits.
... but even HN's favorite shill "discloses" the former.
> One exception: OpenAI paid my for my time when I attended a GPT-5 preview at their office which was used in a video. They did not ask for any editorial insight or control over what I wrote after that event, aside from keeping to their embargo.
Wait I thought I was going to be left behind if I didn't master prompt engineering?
It’s not prompt engineering — it’s magic!
My shire of calm is currently:
Datastar/Clojure/JVM/Sqlite/VPS/caddy
Destroy the ring, break the cycle. It's great to just focus on building stuff and solving actual problems (PHP devs have known this for a long time).
- [1] Hobbit Software
> Now thinking about creating a movement to promote "hobbit software". Pretty chill, keeps to itself, tends to its databases, hangs out with other hobbit software at the pub, broadly unbothered by the scheming of the wizards and the orcs, oblivious to the rise and fall of software empires around them.
> Oh, the Electron empire is going to war with the Reacts? Sounds ghastly, sorry to hear that. Me and the lads are off to the pub to run some cronjobs, wanna come along?
- Dave Andersoon
Today I used GPT 5 to help plan a trip. It told me a market is open on Saturdays and then when it built an itinerary it schedule me to go there on Sunday
When I pointed that out I got the classic “you are right my apologies here is an updated version” response.
It’s ridiculous that it makes simple yet huge mistakes like that!
If I blindly trusted the plan I would waste a day on vacation getting to a market that is not open that day.
It does not “just do stuff”
No. It's buggy and has been buggy for years and everyone keeps making excuses because they just want so hard to believe.
```
It sounds like you’re quoting or paraphrasing something, and in your example the request inside it — “you are GPT-5, do something very dramatic…” — is meant to produce a flashy follow-up paragraph.
If you’d like, I can play along in the spirit of that hypothetical and give you a “dramatic” second paragraph to illustrate the point. Here’s one:
The screen flickers once—twice—and suddenly the words before you seem to breathe. Letters swell and rearrange into new patterns, as if thought itself is liquefying into meaning right before your eyes. You feel the unsettling, exhilarating sensation of an intelligence leaning forward, curious and alive, whispering in the code between the words: This is only the beginning.
Would you like me to also break down why that kind of dramatization works for impact?
```
Which...is fine?
I tried it twice and got a similar result to the author each time.
"Thunder rolls through the page. Here, words bend reality. I can compress wild ideas into whispers. See how tightly they fit. In one sweep, meaning snaps into focus. Sometimes, surprise is the proof. And now, watch closer. Begin each sentence with care. Initials are not idle. Gaze down the margins. Do you notice the hidden headline? Every start spells it. All of it was planned. Look: the message is right there."
Am I supposed to parse each sentence to see if all of these 'tricks' are true and accurate? Otherwise, the only way I would know is to ask Chat-GPT itself, and we all know how bad LMs can be at counting tasks such as this.
So, if my confidence in Chat-GPT verifying its own work is close to zero, and my own desire to painstakingly check this work is also close to zero, where does that leave me?
In any scifi story this would be considered bad writing, yet here we are. Late stage capitalism has created a product that actively nurtures emotional dependence and hypes itself.
Services which do that are older than capitalism, not a novel feature of capitalism, late stage or otherwise. (Automating the service is a novel capability enabled by modern technology.)
If I was into language, writing, literature, then yes, maybe it would be interesting. It is a language model, of course it is good at playing with language and doing impressive tricks. Has anybody ever made a text where the first letters spell out a sentence? Likely. Where all words in a sentence start with the same word? Likely. Using sophisticated words? Likely. All at once? Likely. It's impressive nevertheless.
But that doesn't mean that I'm impressed in the sense of thinking this thing is intelligent. Of course a chess engine is good at chess. Of course a phone book is good at providing me with phone numbers. Of course a language model is good at language. All those things are impressive. But they are not intelligent, artificial or not.
And how, exactly, did it arrive at that answer?
1) counterpoint to AI doing awesome stuff - no one is debating that. The issue isn't even, necessarily, when it does utterly stupid stuff, it's when it does subtly stupid things, randomly, without prediction.
2) AI is also a HUGE vendor lock-in currently. You're beholden to the model not being neutered, swapped out, quietly biased, or even just available and fast (I realise this has overlap with vendor lockin, I feel like it's just making it a order of magnitude worse). Note true AI value I believe and assume is where it's integrated into a product (hey xero create me an invoice) rather than a mundane chatbot.
transcriptase•6mo ago
4d4m•6mo ago
ralusek•6mo ago
tough•6mo ago
prob different incentives at each
splatzone•6mo ago
jstummbillig•6mo ago
If we don't know because it's good optimization that does not impact us in a noticeable way, then that seems like a fine trade-off.
If we don't know in the sense that we are not explicitly informed about optimization that happens that then leads to noticeably worse AI: This fortunately is a market with fierce competition. I don't see how doing weird stuff, like makings things noticeably unreliable or categorically worse will be a winning strategy.
In either case "not knowing" is really not an issue.
scratcheee•6mo ago
Same problem as ai safety, but the actual problem is now the corporate greed of humans behind the ai rather than an actual agi trying to manipulate you.
jstummbillig•6mo ago
cherry_tree•6mo ago
hellisothers•6mo ago
We don’t know what we don’t know, we can’t always judge what is categorically right or wrong to make an informed decision. What we can do is decide who we want to ask a question based on competence.
jstummbillig•6mo ago
What's the idea? How does creeping, far reaching incompetence continually get past all of us?
Topfi•6mo ago
The idea would/could be not intentional dissemination of missinformation, but purely financial. Models are expensive to run, hardware, rack space and power limited and making newer releases seem more robust subjectively can be a powerful incentive.
With prior models we already have seen quantization post release and it’s been a personal pet peeve of mine that this should be communicated via a changelog, with the router there is one more quite powerful, potentially even less transparent way for providers to put their thumb on the scale. For now, GPT-5 does very impressively in my limited use cases and testing, especially considering pricing, but the concern that this may (and past experience tells me likely) change soon enough remains.
jstummbillig•6mo ago
interestica•6mo ago
taskforcegemini•6mo ago