> The specifics are unknown, but they might...
Hold up.
> but some assume that they do it this way.
Come on now.
https://x.com/michpokrass/status/1869102222598152627
It says:
> hey aidan, not a miscommunication, they are different products! o1 pro is a different implementation and not just o1 with high reasoning.
Sounds like it is just o3 with higher thinking budget to me
So far I've only used o3-pro a bit today, and it's a bit too heavy to use interactively (fire it off, revisit in 10-15 minutes), but it seems to generate much cleaner/more well organized code and answers.
I feel like the benchmarks aren't really doing a good job at capturing/reflecting capabilities atm. eg, while Claude 4 Sonnet appears to score about as well as Opus 4, in my usage Opus is always significantly better at solving my problem/writing the code I need.
Besides especially complex/gnarly problems, I feel like a lot of the different models are all good enough and it comes down to reliability. For example, I've stopped using Claude for work basically because multiple times now it's completely eaten my prompts and even artifacts it's generated. Also, it hits limits ridiculously fast (and does so even when on network/resource failures).
I use 4.1 as my workhorse for code interpreter work (creating graphs/charts w/ matplotlib, basic df stuff, converting tables to markdown) as it's just better integrated than the others and so far I haven't caught 4.1 transposing/having errors with numbers (which I've noticed w/ 4o and Sonnet).
Having tested most of the leading edge open and closed models a fair amount, 4.5 is still my current preferred model to actually talk to/make judgement calls (particularly with translations). Again, not reflected in benchmarks, but 4.5 is the only model that gives me the feeling I had when first talking to Opus 3 (eg, of actual fluid intelligence, and a pleasant personality that isn't overly sychophantic) - Opus 4 is a huge regression in that respect for me.
(I also use Codex, Roo Code, Windsurf, and a few other API-based tools, but tbt, OpenAI's ChatGPT UI is generally better for how I leverage the models in my workflow.)
For the past year or two, I've had my own personal 25 question vibe-check I've used on new models to kick the tires, but I think the future is something both a little more rigorous and a little more automated (something like LLM Jury w/ an UltraFeedback criteria based off of your own real world exchanges and then BTL ranked)? A future project...
At this point I don't need smarter general models for my work, I need models that don't hallucinate, that are faster/cheaper, and that have better taste in specific domains. I think that's where we're going to see improvements moving forward.
Also, does anybody know what limits o3-pro has under the team plan? I don't see it available in the model picker at all (on team).
OpenAI dropped the price of o3 by 80%
sama's highlight[0]:
> "The plan o3 gave us was plausible, reasonable; but the plan o3 Pro gave us was specific and rooted enough that it actually changed how we are thinking about our future."
I kept nudging the team to go the whole way to just let o3 be their CEO but they didn't bite yet haha
Dangerous incentives IMO.
we are definitely not seeking to be openai sycophants, nor would they want us to be.
The technology needs to diffuse through and find its equilibrium within the market
You could say 3.5/3.7 Sonnet was good enough to replace some juniors but the juniors didn't get replaced immediately - it has a lag in time for it to ripple through
This announcement adds o3-pro, which pairs with o3 in the same way the o4 models go together.
It should be called o3-high, but to align with the $200 pro membership it’s called pro instead.
That said o3 is already an incredibly powerful model. I prefer it over the new Anthropic 4 models and Gemini 2.5. It’s raw power seems similar to those others, but it’s so good at inline tool use it usually comes out ahead overall.
Any non-trivial code generation/editing should be using an advanced reasoning model, or else you’re losing time fixing more glitches or missing out on better quality solutions.
Of course the caveat is cost, but there’s value on the frontier.
o4-mini-high is the label on chatgpt.com for what in the API is called o4-mini with reasoning={"effort": "high"}. Whereas o4-mini on chatgpt.com is the same thing as reasoning={"effort": "medium"} in the API.
o3 can also be run via the API with reasoning={"effort": "high"}.
o3-pro is different than o3 with high reasoning. It has a separate endpoint, and it runs for much longer.
See https://platform.openai.com/docs/guides/reasoning?api-mode=r...
Of course by now it'll be in-distribution. Time for a new benchmark...
E.g., the pelicans all look pretty cruddy including this one, but the fact that they are being delivered in .SVG is a bigger deal than the quality of the artwork itself, IMHO. This isn't a diffusion model, it's an autoregressive transformer imitating one. The wonder isn't that it's done badly, it's that it's happening at all.
The point is never the pelican. The point is that if a thing has information about pelicans, and has information about bicycles, then why can't it combine those ideas? Is it because it's not intelligent?
We now need to start using walrusses riding rickshaws
And in ChatGPT Pro.
There is quite a few on Google Image search.
On the other hand they still seem to struggle!
https://road.cc/content/blog/90885-science-cycology-can-you-...
ChatGPT seems to perform better than most, but with notable missing elements (where's the chain or the handlebars?). I'm not sure if those are due to a lack of understanding, or artistic liberties taken by the model?
I think you meant to say:
And nobody knows how they work.
From the evidence we have so far, it does not look like there's any natural monopoly (or even natural oligopoly) in AI companies. Just the opposite. Especially with open weight models, or oven more so complete open source models.
What I mean by that, is if a neuron implements a sigmoid function and its input weights are 10,1,2,3 that means if the first input is active, then evaluation the other ones is mathematically pointless, since it doesn't change the result, which recursively means the inputs of those neurons that contribute to the precursors are pointless as well.
I have no idea how feasible or practical is it to implement such an optimization and full network scale, but I think its interesting to think about
It would be interesting if there was a model that was specifically trained on task-oriented data. It's my understanding they're trained on all data available, but I wonder if it can be fine-tuned or given some kind of reinforcement learning on breaking down general tasks to specific implementations. Essentially an agent-specific model.
Nothing fancy. Visual Studio Code + Copilot, agent mode, a couple prompt files, and that's it.
Performance can improve linearly and utility can be massively jumpy. For some people/tasks performance can have improved but it'll have been "interesting but pointless" until it hits some threshold and then suddenly you can do things with it.
You could use Cursor, Windsurf, Q CLI, Claude Code, whatever else with Claude 3 or even an older model and you'd still get usable results.
It's not the models which have enabled "vibe coding", it's the tools.
An additional proof of that is that the new models focus more and more on coding in their releases, and other fields have not benefited at all from the supposed model improvements. That wouldn't be the case if improvements were really due to the models and not the tooling.
I have been using 'aider' as my go to coding tool for over a year. It basically works the same way that it always has: you specify all the context and give it a request and that goes to the model without much massaging.
I can see a massive improvement in results with each new model that arrives. I can do so much more with Gemini 2.5 or Claude 4 than I could do with earlier models and the tool has not really changed at all.
I will agree that for the casual user, the tools make a big difference. But if you took the tool of today and paired it with a model from last year, it would go in circles
There was always going to be diminishing returns in these benchmarks. It's by construction. It's mathematically impossible for that not to happen. But it doesn't mean the models are getting better at a slower pace.
Benchmark space is just a proxy for what we care about, but don't confuse it for the actual destination.
If you want, you can choose to look at a different set of benchmarks like ARC-AGI-2 or Epoch and observe greater than linear improvements, and forget that these easier benchmarks exist.
"ARC-AGI-1: * Low: 44%, $1.64/task * Medium: 57%, $3.18/task * High: 59%, $4.16/task
ARC-AGI-2: * All reasoning efforts: <5%, $4-7/task
Takeaways: * o3-pro in line with o3 performance * o3's new price sets the ARC-AGI-1 Frontier"
Given the models don’t even see the versions we get to see it doesn’t surprise me they have issues we these. It’s not hard to make benchmarks that are so hard that humans and Lims can’t do.
LLMs have obviously reached the point where they are smarter than almost every person alive, better at maths, physics, biology, English, foreign languages, etc.
But because they can’t solve this honestly weird visual/spatial reasoning test they aren’t intelligent?
That must mean most humans on this planet aren’t generally intelligent too.
I dont think memorizing stuff is the same as being smart. https://en.wikipedia.org/wiki/Chinese_room
> But because they can’t solve this honestly weird visual/spatial reasoning test they aren’t intelligent?
Yes. Being intelligent is about recognizing patterns and thats what arc agi tests. It tests ability to learn. A lot of people are not very smart.
I agree. The problem I have with the Chinese Room thought experiment is: just as the human who mechanically reading books to answer questions they don't understands does not themselves know Chinese, likewise no neuron in the human brain knows how the brain works.
The intelligence, such as it is, is found in the process that generated the structure — of the translation books in the Chinese room, of the connectome in our brains, and of the weights in an LLM.
What comes out of that process is an artefact of intelligence, and that artefact can translate Chinese or whatever.
Because all current AI take a huge number of examples to learn anything, I think it's fair to say they're not particularly intelligent — but likewise, they can to an extent make up for being stupid by being stupid very very quickly.
But: this definition of intelligence doesn't really fit "can solve novel puzzle", as there's a lot of room for getting good at that my memorising lot of things that puzzle-creators tend to do.
And any mind (biological or synthetic) must learn patterns before getting started: the problem of induction* is that no finite number of examples is ever guaranteed to be sufficient to predict the next item in a sequence, there is always an infinite set of other possible solutions in general (though in reality bounded by 2^n, where n = the number of bits required to express the universe in any given state).
I suspect, but cannot prove, that biological intelligence learns from fewer examples for a related reason, that our brains have been given a bias by evolution towards certain priors from which "common sense" answers tend to follow. And "common sense" is often wrong, c.f. Aristotelian physics (never mind Newtonian) instead of QM/GR.
I love how the bar for are LLMs smart just goes up every few months.
In a year it will be, well, LLMs didn't create totally breakthrough new Quantum Physics, it's still not as smart as us... lol
I agree things are looking up for LLMs, but the semantics do matter here. In my experience LLMs are still pretty bad at solving novel problems(like arc agi 2) which is why I do not believe they have much intelligence. They seem to have started doing it a little, but are still mostly regurgitating.
Certainly those non-representative humans are much better than current models, but they're also far from scoring 100%.
Also mammals? What mammals could even understand we were giving it a test?
Have you seen them or shown them to average people? I’m sure the people who write them understand them but if you show these problems to average people in the street they are completely clueless.
This is a classic case of some phd ai guys making a benchmark and not really considering what average people are capable of.
Look, these insanely capable ai systems can’t do these problems but the boys in the lab can do them, what a good benchmark.
---
> Have you seen them or shown them to average people? I’m sure the people who write them understand them but if you show these problems to average people in the street they are completely clueless.
I can show them to people on my family, I'll do it today and come back with the answer, it's the best way of testing that out.
Are you thinking of a different set? Arc-agi-2 has average 60% success for a single person and questions require only 2 out of 9 correct answers to be accepted. https://docs.google.com/presentation/d/1hQrGh5YI6MK3PalQYSQs...
> and even some other mammals to do.
No, that's not the case.
Either way, there's something fishy about this presentation, it says: "ARC-AGI-1 WAS EASILY BRUTE-FORCIBLE", but when o3 initially "solved" most of it the co-founder or ARC-PRIZE said: "Despite the significant cost per task, these numbers aren't just the result of applying brute force compute to the benchmark. OpenAI's new o3 model represents a significant leap forward in AI's ability to adapt to novel tasks. This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs. o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.", he was saying confidently that it would not be a result of brute-forcing the problems. And it was not the first time, "ARC-AGI-1 consists of 800 puzzle-like tasks, designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a small number of example input-output pairs (usually around three). This requires the test taker (human or AI) to deduce underlying rules through abstraction, inference, and prior knowledge rather than brute-force or extensive training."
Now they are saying ARC-AGI-2 is not bruteforcible, what is happening there? They didn't provided any reasoning for why one was bruteforcible and the other not, nor how they are so sure about that. They "recognized" that it could be brute-forced before, but in a way less expressive manner, by explicitly stating it would need "unlimited resources and time" to solve. And they are using the non-bruteforceability in this presentation as a point for it.
--- Also, I mentioned mammals because those problems are of an order that mammals and even other animals would need to solve in reality for a diversity of cases. I'm not saying that they would literally be able to take the test and solve it, nor to understand this is a test, but that they would need to solve problems of similar nature in reality. Naturally this point has it's own limits, but it's not easily discarded as you tried to do.
You told someone that their reasoning is so bad they should get checked by a doctor. Because they didn't find the test easy, even though it averages 60% score per person. You've been a dick to them while significantly misrepresenting the numbers - just stop digging.
So on the edge, if he was not able to understand them at all, and this was not just a problem of grasping the problem, my point was that this would possibly indicate a neurological problem, or developmental, due to the nature of them. It's not a question of "you need to get all of them right", his point was that he was unable to understand them at all, that it confused them to an understanding level.
I don't know about that. I think it's mainly because nowadays LLMs can output very inconsistent results. In some applications they can generate surprisingly good code, but during the same session they can also do missteps and shit the bed while following a prompt to small changes. For example, sometimes I still get prompt responses that outright delete critical code. I'm talking about things like asking "extract this section of your helper method into a new methid" and in response the LLM deletes the app's main function. This doesn't happen all the time, or even in the same session for the same command. How does one verify these things?
Even though it's a large10% increase first then only a 0.999% increase.
From 90% to 99% is a 10x reduction in error rate, but 99% to 99.999% is a 1000x decrease in error rates.
90% -> 1 error per 10
99% -> 1 error per 100
99.99% -> 1 error per 10,000
That can help to see the growth in accuracy, when the numbers start getting small (and why clocks are framed as 1 second lost per…).
I guess it's the same problem with the mind not intuitively grasping the concept of exponential growth and how fast it grows.
Humans struggle with understanding exponential growth due to a cognitive bias known as *Exponential Growth Bias (EGB)*—the tendency to underestimate how quickly quantities grow over time. Studies like Wagenaar & Timmers (1979) and Stango & Zinman (2009) show that even educated individuals often misjudge scenarios involving doubling, such as compound interest or viral spread. This is because our brains are wired to think linearly, not exponentially, a mismatch rooted in evolutionary pressures where linear approximations were sufficient for survival.
Further research by Tversky & Kahneman (1974) explains that people rely on mental shortcuts (heuristics) when dealing with complex concepts. These heuristics simplify thinking but often lead to systematic errors, especially with probabilistic or nonlinear processes. As a result, exponential trends—such as pandemics, technological growth, or financial compounding—often catch people by surprise, even when the math is straightforward.
Imo we got to the current state by harnessing GPUs for a 10-20x boost over CPUs. Well, and cloud parallelization, which is ?100x?
ASIC is probably another 10x.
But the training data may need to vastly expand, and that data isn't going to 10x. It's probably going to degrade.
This kind of expectations explains why there hasn't been a GPT-5 so far, and why we get a dumb numbering scheme instead for no reason.
At least Claude eventually decided not to care anymore and release Claude 4 even if the jump from 3.7 isn't particularly spectacular. We're well into the diminishing returns at this point, so it doesn't really make sense to postpone the major version bump, it's not like they're going to make a big leap again anytime soon.
Scaling laws, by definition have always had diminishing returns because it's a power law relationship with compute/params/data, but I am assuming you mean diminishing beyond what the scaling laws predict.
Unless you know the scale of e.g. o3-pro vs GPT-4, you can't definitively say that.
Because of that power law relationship, it requires adding a lot of compute/params/data to see a big jump, rule of thumb is you have to 10x your model size to see a jump in capabilities. I think OpenAI has stuck with the trend of using major numbers to denote when they more than 10x the training scale of the previous model.
* GPT-1 was 117M parameters.
* GPT-2 was 1.5B params (~10x).
* GPT-3 was 175B params (~100x GPT-2 and exactly 10x Turing-NLG, the biggest previous model).
After that it becomes more blurry as we switched to MoEs (and stopped publishing), scaling laws for parameters applies to a monolithic models, not really to MoEs.
But looking at compute we know GPT-3 was trained on ~10k V100, while GPT-4 was trained on a ~25k A100 cluster, I don't know about training time, but we are looking at close to 10x compute.
So to train a GPT-5-like model, we would expect ~250k A100, or ~150k B200 chips, assuming same training time. No one has a cluster of that size yet, but all the big players are currently building it.
So OpenAI might just be reserving GPT-5 name for this 10x-GPT-4 model.
You're assuming wrong, in fact focusing on scaling law underestimate the rate of progress as there is also a steady stream algorithmic improvements.
But still, even though hardware and software progress, we are facing diminishing returns and that means that there's no reason to believe that we will see another leap as big as GPT-3.5 to GPT-4 in a single release. At least until we stumble upon radically new algorithms that reset the game.
I don't think it make any economic sense to wait until you have your “10x model” when you can release 2 or 3 incremental models in the meantime, at which point your “x10” becomes an incremental improvement in itself.
Besides, I do think that Google Gemini 2.0 and its massively increased token memory was another "big leap". And that was released earlier this year, so I see no sign of development slowing down yet.
https://www.svgviewer.dev/s/c3j6TEAP
in case anyone is interested
Have completed around a dozen chats with o3-pro so far. Can't say I'm impressed, output feels qualitatively very similar to regular o3.
Tried feeding in loads of context as suggested in the article but generally feels like a miss.
I have dreamed of having powerful AI ever since I read Bertram Raphael's great book Mind Inside Matter around 1978, getting hooked on AI research and sometimes practical applications for my life since then.
I can easily afford $200 for a Pro account but I get this nagging feeling that LLMs are not the final path to the powerful AI I have always dreamed of and I don't want to support this level of hype.
I have lived through a few AI winters and I worry that accountants will tally up the costs, environmental and money, versus the benefits and that we collectively have an 'oh shit' moment.
If we froze LLM technology at present-day capabilities and spent the next 20 years on that, I'd expect it to ultimately look transformative in a similar way to the Internet. I mean if you told me in fall 2022 that 2.5 years later I'd be building software by meta-prompting and meta-meta-prompting AI agents to write code overnight while I slept, I'd assume that we were fictional characters in a Black Mirror episode.
It created the image showing each month but when you looked at each month it was so janky ... February 31st and other huge errors!
I'm not using image creation to create 3d art for fun or art sake im trying to use it to create utility images to share for discussion with friends & co-workers. The above is just one of many ways it fails when creating utility images!
I shouldnt need to know how to do that as a GPT and or AI user ... the AI should just do it for the user via their request in the text box. That's the magic of AI to me.
Does anyone know what it did or returned? I had not seen anything, nor have I read anything, about issues here.
DeepSeek isn't bad either (especially given its age now), and Claude is great for coding and tool use but too damn expensive.
PS: Thinking about it... that is a very specific kind of disturbing feeling that only prolific online commenters can experience...
There's a soulless machine someone made that -- out of billions of people on the planet -- specifically knows me by name and at some level understands how I think and see the world.
That's not even its explicit purpose! It and its maker have never met me, interacted with me, or singled me out in any way. Yet... it knows my voice and can copy it on demand.
The machine does not understand you. The machine can match your flavor of textual communication.
This can be done for audio with a relatively small number of samples. Your iPhone has a feature called Personal Voice which claims it can do it with 150 phrases/15 minutes of your time.
---
Here's an algorithmic problem:
You are getting a stream of n unsorted number (say over a network socket, but it doesn't matter). You don't know n upfront. We want to find the k largest numbers in that stream.
You can use O(n) time and O(k) space. We are in the comparison model.
The items arrive one by one, if you want to refer to any earlier item, you need to store it yourself (and it counts against your O(k) budget.)
Is this possible? If yes, please give me the algorithm. If not, please sketch a proof showing that it's not possible.
---
The above is indeed solvable in linear time, it's fairly easy for a human to figure out. Another one (and this on is rather hard, took me a few years and I'm writing up a paper):
---
Here's an algorithmic problem:
You are given a sequence of opening and closing parens. Each item in the sequence has a positive weight. We want to find the _heaviest_ balanced subsequence in linear time in the comparison model, or prove that this task is not possible.
I'm ok with randomised algorithms. In that case, I want expected worst-case linear time, where the expectation is taken over the random bits and the worst-case over the inputs.
---
The above task is really solvable in linear time, and even deterministically. But so far no AI model has beaten it. As far as I can tell, it's a new result, despite looking fairly elementary.
mmsc•8mo ago
Osyris•8mo ago
However, the "plus" plan absolutely could use some trimming.
djrj477dhsnv•8mo ago
bachittle•8mo ago
stavros•8mo ago
rfw300•8mo ago
nikcub•8mo ago
CamperBob2•8mo ago
Sounds like o3-pro is even slower, which is fine as long as it's better.
o4-mini-high is my usual go-to model if I need something better than the default GPT4-du jour. I don't see much point in the others and don't understand why they remain available. If o3-pro really is consistently better, it will move o1-pro into that category for me.
CuriouslyC•8mo ago
motoxpro•7mo ago
blurbleblurble•8mo ago
nickysielicki•8mo ago
browningstreet•8mo ago
moomin•8mo ago
orra•8mo ago
Maken•7mo ago
moomin•7mo ago
It's short for XBOX.
There's three Xs.
They're all short for XBOX.
simonw•8mo ago
> how about we fix our model naming by this summer and everyone gets a few more months to make fun of us (which we very much deserve) until then?
nickysielicki•8mo ago
lobsterthief•8mo ago
asah•8mo ago
MallocVoidstar•8mo ago
kaoD•8mo ago
transcriptase•8mo ago
GPT-4o
o3
o4-mini
o4-mini-high
GPT-4.5
GPT-4.1
GPT-4.1-mini
koakuma-chan•8mo ago
ralfd•8mo ago
thimabi•7mo ago
koakuma-chan•7mo ago
occamschainsaw•7mo ago
``` Below is one straightforward, user-friendly approach you could adopt. It keeps two dimensions only—generation and tier—and reserves an optional “optimisation” suffix for special-purpose variants (e.g. vision, coding, long-context).
⸻
1. Core conventions
Element Purpose Example values Generation Major architectural release. Keep a whole number; use “.1”, “.2”… for mid-cycle improvements. 4, 4.1, 4.5 Tier Rough capability / cost band, easy to interpret. Lite, Standard, Pro, Ultra Suffix (optional) Special optimisation or domain specialisation. -LongCtx, -Vision, -Code
Why this works • No ambiguous letters or numerics – “o3” can be read as “03” or “oz”; avoid that entirely. • Self-explanatory language – non-technical users recognise “Lite” versus “Pro” instantly. • Scalable – new minor rev? bump the generation (4.2). Need a cheaper size? add a Nano tier without disturbing the rest.
⸻
2. Applying it to your current list
Current name Proposed new name Rationale GPT-4o GPT-4 Standard Baseline flagship of the 4-series. o3 GPT-4 Lite Same generation, lowest tier. o4-mini GPT-4 Lite+ (or GPT-4 Lite LongCtx if that’s the point) Indicates “Lite” family but a bit more capable; “+” or a suffix clarifies how. o4-mini-high GPT-4 Standard LongCtx (or GPT-4 Lite Pro) Pick one dimension: either it’s still “Lite” but higher context, or it has moved into “Standard”. GPT-4.5 GPT-4.5 Standard Mid-cycle architectural upgrade, default tier. GPT-4.1 GPT-4.1 Standard Ditto. GPT-4.1-mini GPT-4.1 Lite Same generation, smaller/cheaper option.
⸻
3. Quick style guide for future models 1. Stick to two words (or two words + optional suffix) GPT-5 Pro, GPT-5 Lite-Vision – still readable at a glance. 2. Reserve extra punctuation for special cases only Hyphens or the “+” symbol should signal meaning, not be decorative. 3. Publish a public matrix A small table in docs or the dashboard that maps Generation × Tier → context length, cost, latency eliminates guesswork.
⸻
One-line summary
GPT- [-Specialisation] keeps names short, descriptive and future-proof—so even non-technical users can tell instantly which model suits their needs. ```
aetherspawn•8mo ago
For example the other day they released a supposedly better model with a lower number..
aetherspawn•8mo ago
dmos62•8mo ago
levocardia•8mo ago
paxys•8mo ago
Those users go to chat.openai.com (or download the app), type text in the box and click send.
AtlasBarfed•8mo ago
Port unix-sed from c to java with a full test suite and all options supported.
Somewhere between "it answers questions of life" and "it beats PhDs at math questions", I'd like to see one LLM take this, IMO, rather "pure" language task and succeeed.
It is complicated, but it isn't complex. It's string operations with a deep but not that deep expression system and flag set.
It is well-described and documented on the internet, and presumably training sets. It is succinctly described as a problem that virtually all computer coders would understand what it entailed if it were assigned to them. It is drudgerous, showing the opportunity for LLMs to show how they would improve true productivity.
GPT fails to do anything other than the most basic substitute operations. Claude was only slightly better, but to its detriment hallucinated massive amounts and made fake passing test cases that didn't even test the code.
The reaction I get to this test is ambivalence, but IMO if LLMs could help port entire software packages between languages with similar feature sets (aside from Turing Completeness), then software cross-use would explode, and maybe we could port "vulnerable" code to "safe" Rust en masse.
I get it, it's not what they are chasing customer-wise. They want to write (in n-gate terms) webcrap.
CamperBob2•8mo ago
AtlasBarfed•8mo ago
nipah•8mo ago
The normal o3 also managed to break 3 isolated installations of linux I was trying it with, a few days ago. The task was very simple, simply setup ubuntu with btrfs, timeshift and grub-btrfs and it managed to fail every single time (even when searching the web), so it was not impressive either.
jiggawatts•7mo ago
.NET Framework 4.x to .NET 10, Python 2 to 3, Java 8 to <current version>, etc...
The advantage the LLMs have here is that staying within the same programming language and its paradigm is dramatically simpler than converting a "procedural" language like C to an object-oriented language like Java that has a wildly different standard library.
resters•8mo ago
I think the naming scheme is just fine and is very straightforward to anyone who pays the slightest bit of attention.