Well, that didn't last long.
I wish OpenAI had invented this but it’s not that uncommon.
claude 3.7 32k thinking tokens (diff) - 64.9%
GPT-4.1 (diff) - 52.9% (stat is from the blog post)
They are reporting that GPT-4.1 gets 55%.
Andrej Karpathy famously quipped that he only trusts two LLM evals: Chatbot Arena (which has humans blindly compare and score responses), and the r/LocalLLaMA comment section.
In practice you have to evaluate the models yourself for any non-trivial task.
This is pretty common across industries. The leader doesn’t compare themselves to the competition.
[1] https://blog.google/technology/google-deepmind/gemini-model-...
[2] https://ai.meta.com/blog/llama-4-multimodal-intelligence/
I dont understand the constant complaining about naming conventions. The number system differentiates the models based on capability, any other method would not do that. After ten models with random names like "gemini", "nebula" you would have no idea which is which. Its a low IQ take. You dont name new versions of software as completely different software
Also, Yesterday, using v0, I replicated a full nextjs UI copying a major saas player. No backend integration, but the design and UX were stunning, and better than I could do if I tried. I have 15 years of backend experience at FAANG. Software will get automated, and it already is, people just havent figured it out yet
Exactly. Those who do frontend or focus on pretty much anything Javascript are, how should I say it? Cooked?
> Software will get automated
The first to go are those that use JavaScript / TypeScript engineers have already been automated out of a job. It is all over for them.
In general I use Cursor in manual mode asking it to make very well scoped small changes (e.g. “write this function that does this in this exact spot”). Yesterday I needed to make a largely mechanical change (change a concept in the front end, make updates to the corresponding endpoints, update the data access methods, update the database schema).
This is something very easy I would expect a junior developer to be able to accomplish. It is simple, largely mechanical, but touches a lot of files. Cursor agent mode puked all over itself using Gemini 2.5. It could summarize what changes would need to be made, but it was totally incapable of making the changes. It would add weird hard coded conditions, define new unrelated files, not follow the conventions of the surrounding code at all.
TLDR; I think LLMs right now are good for greenfield development (create this front end from scratch following common patterns), and small scoped changes to a few files. If you have any kind of medium sized refactor on an existing code base forget about it.
Gemini 2.5 is currently broken with the Cursor agent; it doesn't seem to be able to issue tool calls correctly. I've been using Gemini to write plans, which Claude then executes, and this seems to work well as a workaround. Still unfortunate that it's like this, though.
AI is amazing, now all you need to create a stunning UI is for someone else to make it first so an AI can rip it off. Not beating the "plagiarism machine" allegations here.
https://a16z.com/the-future-of-work-cars-and-the-wisdom-in-s...
Please rank GPT-4, GPT-4 Turbo, GPT-4o, GPT-4.1-nano, GPT-4.1-mini, GPT-4.1, GPT-4.5, o1-mini, o1, o1 pro, o3-mini, o3-mini-high, o3, and o4-mini in terms of capability without consulting any documentation.
Then, some are not available yet: o3 and o4-mini. GPT-4.1 I haven't played with enough to give you my opinion on.
Among the rest, it depends on what you're looking for:
Multi-modal: GPT-4o > everything else
Reasoning: o1-pro > o3-mini-high > o3-mini
Speed: GPT-4o > o3-mini > o3-mini-high > o1-pro
(My personal favorite is o3-mini-high for most things, as it has a good tradeoff between speed and reasoning. Although I use 4o for simpler queries.)
Chronologically:
GPT-4, GPT-4 Turbo, GPT-4o, o1-preview/o1-mini, o1/o3-mini/o3-mini-high/o1-pro, gpt-4.5, gpt-4.1
Model iterations, by training paradigm:
SGD pretraining with RLHF: GPT-4 -> turbo -> 4o
SGD pretraining w/ RL on verifiable tasks to improve reasoning ability: o1-preview/o1-mini -> o1/o3-mini/o3-mini-high (technically the same product with a higher reasoning token budget) -> o3/o4-mini (not yet released)
reasoning model with some sort of Monte Carlo Search algorithm on top of reasoning traces: o1-pro
Some sort of training pipeline that does well with sparser data, but doesn't incorporate reasoning (I'm positing here, training and architecture paradigms are not that clear for this generation): gpt-4.5, gpt-4.1 (likely fine-tuned on 4.5)
By performance: hard to tell! Depends on what your task is, just like with humans. There are plenty of benchmarks. Roughly, for me, the top 3 by task are:
Creative Writing: gpt-4.5 -> gpt-4o
Business Comms: o1-pro -> o1 -> o3-mini
Coding: o1-pro -> o3-mini (high) -> o1 -> o3-mini (low) -> o1-mini-preview
Shooting the shit: gpt-4o -> o1
It's not to dismiss that their marketing nomenclature is bad, just to point out that it's not that confusing for people that are actively working with these models have are a reasonable memory of the past two years.
- Is Ford Better than Chevy? (Comparison across providers) It depends on what you value, but I guarantee there's tribes that are sure there's only one answer.
- Is the 6th gen 2025 4Runner better than 5th gen 2024 4Runner? (Comparison of same model across new releases) It depends on what you value. It is a clear iteration on the technology, but there will probably be more plastic parts that will annoy you as well.
- Is the 2025 BMW M3 base model better than the 2022 M3 Competition (Comparing across years and trims)? Starts to depend even more on what you value.
Providers need to delineate between releases, and years, models, and trims help do this. There are companies that will try to eschew this and go the Tesla route without models years, but still can't get away from it entirely. To a certain person, every character in "2025 M3 Competition xDrive Sedan" matters immensely, to another person its just gibberish.
But a pure ranking isn't the point.
However, it's still not as bad as Intel CPU naming in some generations or USB naming (until very recently). I know, that's a very low bar... :-)
4.0.5.worsethan4point5
Oh man. Unfolding my lawn chair and grabbing a bucket of popcorn for this discussion.
macOS releases would like a word with you.
https://en.wikipedia.org/wiki/MacOS#Timeline_of_releases
Technically they still have numbers, but Apple hides them in marketing copy.
Though they still have “macOS” in the name. I’m being tongue-in-cheek.
To be honest I think this is most AI labs (particularly the American ones) not-so-secret goal now, for a number of strong reasons. You can see it in this announcements, Anthrophic's recent Claude 3.7 announcement, OpenAI's first planned agent (SWE-Agent), etc etc. They have to justify their worth somehow and they see it as a potential path to do that. Remains to be seen how far they will get - I hope I'm wrong.
The reasons however for picking this path IMO are:
- Their usage statistics show coding as the main user: Anthrophic recently released their stats. Its become the main usage of these models, with other usages at best being novelty or conveniences for people in relative size. Without this market IMO the hype would of already fizzled awhile ago at best a novelty when looking at the rest of the user base size.
- They "smell blood" to disrupt and fear is very effective to promote their product: This IMO is the biggest one. Disrupting software looks to be an achievable goal, but it also is a goal that has high engagement compared to other use cases. No point solving something awesome if people don't care, or only care for awhile (e.g. meme image generation). You can see the developers on this site and elsewhere in fear. Fear is the best marketing tool ever and engagement can last years. It keeps people engaged and wanting to know more; and talking about how "they are cooked" almost to the exclusion of everything else (i.e. focusing on the threat). Nothing motivates you to know a product more than not being able to provide for yourself, your family, etc to the point that most other tech topics/innovations are being drowned out by AI announcements.
- Many of them are losing money and need a market to disrupt: Currently the existing use cases of a chat bot are not yet impressive enough (or haven't been till very recently) to justify the massive valuations of these companies. Its coding that is allowing them to bootstrap into other domains.
- It is a domain they understand: AI dev's know models, they understand the software process. It may be a complex domain requiring constant study, but they know it back to front. This makes it a good first case for disruption where the data, and the know how is already with the teams.
TL;DR: They are coming after you, because it is a big fruit that is easier to pick for them than other domains. Its also one that people will notice either out of excitement (CEO, VC's, Management, etc) or out of fear (tech workers, academics, other intellectual workers).
But the price is what matters.
A massive transformer-based language model requiring:
- 128 Xeon server-grade CPUs
- 25,000MB RAM minimum (40,000MB recommended)
- 80GB hard disk space for model weights
- Dedicated NVIDIA Quantum Accelerator Cards (minimum 8)
- Enterprise-grade cooling solution
- Dedicated 30-amp power circuit
- Windows NT Advanced Server with Parallel Processing Extensions
~
Features:
- Natural language understanding and generation
- Context window of 8,192 tokens
- Enterprise security compliance module
- Custom prompt engineering interface
- API gateway for third-party applications
*Includes 24/7 on-call Microsoft support team and requires dedicated server room with raised floor cooling
The lack of availability in ChatGPT is disappointing, and they're playing on ambiguity here. They are framing this as if it were unnecessary to release 4.1 on ChatGPT, since 4o is apparently great, while simultaneously showing how much better 4.1 is relative to GPT-4o.
One wager is that the inference cost is significantly higher for 4.1 than for 4o, and that they expect most ChatGPT users not to notice a marginal difference in output quality. API users, however, will notice. Alternatively, 4o might have been aggressively tuned to be conversational while 4.1 is more "neutral"? I wonder.
Vs in the API, I want to have very strict versioning of the models I'm using. And so letting me run by own evals and pick the model that works best.
Supposedly that’s coming with GPT 5.
They still have a mess of models in ChatGPT for now, and it doesn't look like this is going to get better immediately (even though for GPT-5, they ostensibly want to unify them). You have to choose among all of them anyway.
I'd like to be able to choose 4.1.
Does it, though? They said that "many" have already been incorporated. I simply don't buy their vague statements there. These are different models. They may share some training/post-training recipe improvements, but they are still different.
gpt-4.1
- Input: $2.00
- Cached Input: $0.50
- Output: $8.00
gpt-4.1-mini
- Input: $0.40
- Cached Input: $0.10
- Output: $1.60
gpt-4.1-nano
- Input: $0.10
- Cached Input: $0.025
- Output: $0.40
It's still not as notable as Claude's 1/10th the cost of raw input, but it shows OpenAI's making improvements in this area.
I'm not as concerned about nomenclature as other people, which I think is too often reacting to a headline as opposed to the article. But in this case, I'm not sure if I'm supposed to understand nano as categorically different than many in terms of what it means as a variation from a core model.
gpt-4o-mini for comparison:
- Input: $0.15
- Cached Input $0.075
- Output: $0.60
I was using gpt-4o-mini with batch API, which I recently replaced with mistral-small-latest batch API, which costs $0.10/$0.30 (or $0.05/$0.15 when using the batch API). I may change to 4.1-nano, but I'd have to be overwhelmed by its performance in comparision to mistral.
> Qodo tested GPT‑4.1 head-to-head against other leading models [...] they found that GPT‑4.1 produced the better suggestion in 55% of cases
The linked blog post goes 404: https://www.qodo.ai/blog/benchmarked-gpt-4-1/
I don't understand why the comparison in the announcement talks so much about comparing with 4o's coding abilities to 4.1. Wouldn't the relevant comparison be to o3-mini-high?
4.1 costs a lot more than o3-mini-high, so this seems like a pertinent thing for them to have addressed here. Maybe I am misunderstanding the relationship between the models?
Pricing wise the per token cost of o3-mini is less than 4.1 but keep in mind o3-mini is a reasoning model and you will pay for those tokens too, not just the final output tokens. Also be aware reasoning models can take a long time to return a response... which isn't great if you're trying to use an API for interactive coding.
There are tons of comparisons to o3-mini-high in the linked article.
>Note that GPT‑4.1 will only be available via the API. In ChatGPT, many of the improvements in instruction following, coding, and intelligence have been gradually incorporated into the latest version
If anyone here doesn't know, OpenAI does offer the ChatGPT model version in the API as chatgpt-4o-latest, but it's bad because they continuously update it so businesses can't reliably rely on it being stable, that's why OpenAI made GPT 4.1.
Version explicitly marked as "latest" being continuously updated it? Crazy.
SWE Aider Cost Fast Fresh
Claude 3.7 70% 65% $15 77 8/24
Gemini 2.5 64% 69% $10 200 1/25
GPT-4.1 55% 53% $8 169 6/24
DeepSeek R1 49% 57% $2.2 22 7/24
Grok 3 Beta ? 53% $15 ? 11/24
I'm not sure this is really an apples-to-apples comparison as it may involve different test scaffolding and levels of "thinking". Tokens per second numbers are from here: https://artificialanalysis.ai/models/gpt-4o-chatgpt-03-25/pr... and I'm assuming 4.1 is the speed of 4o given the "latency" graph in the article putting them at the same latency.Is it available in Cursor yet?
[1] https://twitter.com/cursor_ai/status/1911835651810738406
[2] https://twitter.com/windsurf_ai/status/1911833698825286142
Edit: Now also in Cursor
Looks like they also added the cost of the benchmark run to the leaderboard, which is quite cool. Cost per output token is no longer representative of the actual cost when the number of tokens can vary by an order of magnitude for the same problem just based on how many thinking tokens the model is told to use.
[1] https://aider.chat/docs/more/edit-formats.html#diff-fenced:~...
This benchmark has an authoritative source of results (the leaderboard), so it seems obvious that it's the number that should be used.
Interesting data point about the models behavior, but even moreso it's a recommendation of which way to configure the model for optimal performance.
I do consider this to be an apple-to-apples benchmark since they're evaluating real world performance.
Based on some DMs with the Gemini team, they weren't aware that aider supports a "diff-fenced" edit format. And that it is specifically tuned to work well with Gemini models. So they didn't think to try it when they ran the aider benchmarks internally.
Beyond that, I spend significant energy tuning aider to work well with top models. That is in fact the entire reason for aider's benchmark suite: to quantitatively measure and improve how well aider works with LLMs.
Aider makes various adjustments to how it prompts and interacts with most every top model, to provide the very best possible AI coding results.
Results, with other models for comparison:
Model Score Cost
Gemini 2.5 Pro Preview 03-25 72.9% $ 6.32
claude-3-7-sonnet-20250219 64.9% $36.83
o3-mini (high) 60.4% $18.16
Grok 3 Beta 53.3% $11.03
* gpt-4.1 52.4% $ 9.86
Grok 3 Mini Beta (high) 49.3% $ 0.73
* gpt-4.1-mini 32.4% $ 1.99
gpt-4o-2024-11-20 18.2% $ 6.74
* gpt-4.1-nano 8.9% $ 0.43
Aider v0.82.0 is also out with support for these new models [1]. Aider wrote 92% of the code in this release, a tie with v0.78.0 from 3 weeks ago.I get asked this often enough that I have a FAQ entry with automatically updating statistics [0].
Model Tokens Pct
Gemini 2.5 Pro 4,027,983 88.1%
Sonnet 3.7 518,708 11.3%
gpt-4.1-mini 11,775 0.3%
gpt-4.1 10,687 0.2%
[0] https://aider.chat/docs/faq.html#what-llms-do-you-use-to-bui...Deepseek for general chat and research Claude 3.7 for coding Gemini 2.5 Pro experimental for deep research
In terms of price Deepseek is still absolutely fire!
OpenAI is in trouble honestly.
GPT 4.1 is the first model that has provided a human-quality answer to these questions. It seems to be the first model that can follow plotlines, and character motivations accurately.
I'd say since text processing is a very important use case for LLMs, that's quite noteworthy.
Gemini was drastically cheaper for image/video analysis, I'll have to see how 4.1 mini and nano compare.
(Direct Link) https://raw.githubusercontent.com/KCORES/kcores-llm-arena/re...
I found from my experience with Gemini models that after ~200k that the quality drops and that it basically doesn't keep track of things. But I don't have any numbers or systematic study of this behavior.
I think all providers who announce increased max token limit should address that. Because I don't think it is useful to just say that max allowed tokens are 1M when you basically cannot use anything near that in practice.
But I'd love to see one specifically for "meaningful coding." Coding has specific properties that are important such as variable tracking (following coreference chains) described in RULER[1]. This paper also cautions against Single-Needle-In-The-Haystack tests which I think the OpenAI one might be. You really need at least Multi-NIAH for it to tell you anything meaningful, which is what they've done for the Gemini models.
I think something a bit more interpretable like `pass@1 rate for coding turns at 128k` would so much more useful than "we have 1m context" (with the acknowledgement that good-enough performance is often domain dependant)
[0] https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o...
IMO this is the best long context benchmark. Hopefully they will run it for the new models soon. Needle-in-a-haystack is useless at this point. Llama-4 had perfect needle in a haystack results but horrible real-world-performance.
Novels are usually measured in terms of words; and there's a rule of thumb that four tokens make up about three words. So that 200k token wall you're hitting is right when most authors stop writing. 150k is already considered long for a novel, and to train 1M properly, you'd need not only a 750k book, but many of them. Humans just don't write or read that much text at once.
To get around this, whoever is training these models would need to change their training strategy to either:
- Group books in a series together as a single, very long text to be trained on
- Train on multiple unrelated books at once in the same context window
- Amplify the gradients by the length of the text being trained on so that the fewer long texts that do exist have greater influence on the model weights as a whole.
I suspect they're doing #2, just to get some gradients onto the longer end of the context window, but that also is going to diminish long-context reasoning because there's no reason for the model to develop a connection between, say, token 32 and token 985,234.
How many tokens is a 100 pages PDF? 10k to 100k?
For a 100 page book, that translates to around 50,000 tokens. For 1 mil+ tokens, we need to be looking at 2000+ page books. That's pretty rare, even for documentation.
It doesn't have to be text-based, though. I could see films and TV shows becoming increasingly important for long-context model training.
https://en.wikipedia.org/wiki/List_of_chiropterans
Despite listing all presently known bats, the majority of "list of chiropterans" byte count is code that generates references to the IUCN Red List, not actual text. Most of Wikipedia's longest articles are code.
LLMs process tokens sequentially, first in a prefilling stage, where it reads your input, then in the generation stage where it outputs response tokens. The attention mechanism is what allows the LLM as it is ingesting or producing tokens to "notice" that a token it has seen previously (your instruction) is related with a token it is now seeing (the code).
Of course this mechanism has limits (correlated with model size), and if the LLM needs to take the whole input in consideration to answer the question the results wouldn't be too good.
RoPE (Rotary Positional Embeddings, think modulo or periodic arithmetics) scaling is key, whereby the model is trained on 16k tokens long content, and then scaled up to 100k+ [0]. Qwen 1M (who has near perfect recall over the complete window [1]) and Llama 4 10M pushed the limits of this technique, with Qwen reliably training with a much higher RoPE base, and Llama 4 coming up with iRoPE which claims scaling to extremely long contexts up to infinity.
[0]: https://arxiv.org/html/2310.05209v2
[1]: https://qwenlm.github.io/blog/qwen2.5-turbo/#passkey-retriev...
Also, I don't know about Qwen, but I know Llama 4 has severe performance issues, so I wouldn't use that as an example.
Re: Llama 4, please see the sibling comment.
* information from the entire context has to be squeezed into an information channel of a fixed size; the more information you try to squeeze the more noise you get
* selection of what information passes through is done using just dot-product
Training data isn't the problem.In principle, as you scale transformer you get more heads and more dimensions in each vector, so bandwidth of attention data bus goes up and thus precision of recall goes up too.
Updated results from the authors: https://github.com/adobe-research/NoLiMa
It's the best known performer on this benchmark, but still falls off quickly at even relatively modest context lengths (85% perf at 16K). (Cutting edge reasoning models like Gemini 2.5 Pro haven't been evaluated due to their cost and might outperform it.)
> For instance, the NoLiMa benchmark revealed that models like GPT-4o experienced a significant drop from a 99.3% performance rate at 1,000 tokens to 69.7% at 32,000 tokens. Similarly, Llama 3.3 70B's effectiveness decreased from 97.3% at 1,000 tokens to 42.7% at 32,000 tokens, highlighting the challenges LLMs face with longer contexts.
Is there a reliable method for pruning, summarizing, or otherwise compressing context to overcome such issues?
- Coding accuracy improved dramatically
- Handles 1M-token context reliably
- Much stronger instruction following
Which means that these models are _absolutely_ not SOTA, and Gemini 2.5 pro is much better, and Sonnet is better, and even R1 is better.
Sorry Sam, you are losing the game.
Won’t the reasoning models of openAI benchmarked against these be a test of if Sam is losing?
With Gemini (current SOTA) and Sonnet (great potential, but tends to overengineer/overdo things) it is debatable, they are probably better than R1 (and all OpenAI models by extension).
It would be incredible to be able to feed an entire codebase into a model and say "add this feature" or "we're having a bug where X is happening, tell me why", but then you are limited by the output token length
As others have pointed out too, the more tokens you use, the less accuracy you get and the more it gets confused, I've noticed this too
We are a ways away yet from being able to input an entire codebase, and have it give you back an updated version of that codebase.
Why not use Gemini?
All the solutions are already available on the internet on which various models are trained, albeit in various ratios.
Any variance could likely be due to the mix of the data.
If you care about understanding relative performance between models for solving known problems and producing correct output format, it's pretty useful.
- Even for well-known problems, we see a large distribution of quality between models (5 to 75% correctness) - Additionally, we see a large distribution of model's ability to produce responses in formats they were instructed in
At the end of the day, benchmarks are pretty fuzzy, but I always welcome a formalized benchmark as a means to understand model performance over vibe checking.
why would they deprecate when it's the better model? too expensive?
Too expensive, but not for them - for their customers. The only reason they’d deprecated it is if it wasn’t seeing usage worth keeping it up and that probably stems from it being insanely more expensive and slower than everything else.
I'm guessing the (API) demand isn't there to saturate them fully
As opposed to Gemini 2.5 Pro having cutoff of Jan 2025.
Honestly this feels underwhelming and surprising. Especially if you're coding with frameworks with breaking changes, this can hurt you.
100% backwards compatibility and well represented in 15 years worth of training data, hah.
(I did use Spring, once, ages ago, and we deployed the app to a local Tomcat server in the office...)
As you are new in the field, it kinda doesn't make sense to pick an older version. It would be better if there was no data than incorrect data. You literally have to include the version number on every prompt and even that doesn't guarantee a right result! Sometimes I have to play truth or dare three times before we finally find the right names and instructions. Yes I have the version info on all custom information dialogs, but it is not as effective as including it in the prompt itself.
Searching the web feels like an on-going "I'm feeling lucky" mode. Anyway, I still happen to get some real insights from GPT4o, even though Gemini 2.5 Pro has proven far superior for larger and more difficult contexts / problems.
The best storytelling ideas have come from GPT 4.5. Looking forward to testing this new 4.1 as well.
are you doing 3d? The 3D tutorial ecosystem is very GUI heavy and I have had major problems trying to get godot to do anything 3D
I strongly recommend giving Gemini 2.5 Pro a shot. Personally I don't like their bloated UI, but you can set the temperature value, which is especially helpful when you are more certain what and how you want, then just lower that value. If you want to get some wilder ideas, turn it up. Also highly recommend reading the thought process it does! That was actually key in having very complex ideas working. Just spotting couple of lines there, that seem too vague or even just a little bit inaccurate ... then pasting them back, with your own comments, have helped me a ton.
Is there a specific part in which you struggle? And FWIW, I've been on a heavy learning spree for 2 weeks. I feel like I'm starting to see glimbses from the barrel's bottom ... it's not so deep, you just gotta hang in there and bombard different LLMs with different questions, different angles, stripping away most and trying the simplest variation, for both prompt and godot. Or sometimes by asking more general advice "what is current godot best practice in doing x".
And YouTube has also been helpful source, by listening how more experienced users make their stuff. You can mostly skim through the videos with doublespeed and just focus on how they are doing the basics. Best of luck!
E.g.: If context windows get big and cheap enough (as things are trending), hopefully you can just dump the entire docs, examples, and more in every request.
nice to see that we aren't stuck in october of 2023 anymore!
> Qodo tested GPT‑4.1 head-to-head against Claude Sonnet 3.7 on generating high-quality code reviews from GitHub pull requests. Across 200 real-world pull requests with the same prompts and conditions, they found that GPT‑4.1 produced the better suggestion in 55% of cases. Notably, they found that GPT‑4.1 excels at both precision (knowing when not to make suggestions) and comprehensiveness (providing thorough analysis when warranted).
55% to 45% definitely isn't a blowout but it is meaningful — in terms of ELO it equates to about a 36 point difference. So not in a different league but definitely a clear edge
Um, isn't that just a fancy way of saying it is slightly better
>Score of 6.81 against 6.66
So very slightly better
55% vs. 45% equates to about a 36 point difference in ELO. in chess that would be two players in the same league but one with a clear edge
They didn't say it is better than Claude at precision etc. Just that it excels.
Unfortunately, AI has still not concluded that manipulations by the marketing dept is a plague...
- for N=100, worst case standard error of the mean is ~5% (it shrinks parabolically the further p gets from 50%)
- multiply by ~2 to go from standard error of the mean to 95% confidence interval
- scale sample size by sqrt(N)
So:
- N=100: +/- 10%
- N=1000: +/- 3%
- N=10000: +/- 1%
(And if comparing two independent distributions, multiply by sqrt(2). But if they’re measured on the same problems, then instead multiply by between 1 and sqrt(2) to account for them finding the same easy problems easy and hard problems hard - aka positive covariance.)
the p-value for GPT-4.1 having a win rate of at least 49% is 4.92%, so we can say conclusively that GPT-4.1 is at least (essentially) evenly matched with Claude Sonnet 3.7, if not better.
Given that Claude Sonnet 3.7 has been generally considered to be the best (non-reasoning) model for coding, and given that GPT-4.1 is substantially cheaper ($2/million input, $8/million output vs. $3/million input, $15/million output), I think it's safe to say that this is significant news, although not a game changer
Specifically, the results from the blog post are impossible: with 200 samples, you can't possibly have the claimed 54.9/45.1 split of binary outcomes. Either they didn't actually make 200 tests but some other number, they didn't actually get the results they reported, or they did some kind of undocumented data munging like excluding all tied results. In any case, the uncertainty about the input data is larger than the uncertainty from the rounding.
[0] In R, binom.test(110, 200, 0.5, alternative="greater")
Now you can imagine introducing a newer "type" of model like 4.1 that's better at following instructions and better at coding to bring a sort of overhead thats already too much with the given options.
OpenAI confirmed somewhere that they have already incorporated the enhancements made in 4.1 to 4o model in ChatGPT UI. I assume they would delegate to 4.1 model if the prompt doesn't require specific 4o capabilities.
Also one of the improvements made to 4.1 is following instructions. This type of thing is better suited for agentic use cases that are typically used in the form of an API.
> GPT‑4.5 Preview will be turned off in three months, on July 14, 2025
Tool use ability feels ability better than gemini-2.5-pro-exp [2] which struggles with JSON schema understanding sometimes.
Llama 4 has suprising agentic capabilities, better than both of them [3] but isn't as intelligent as the others.
[1] https://github.com/rusiaaman/chat.md/blob/main/samples/4.1/t...
[2] https://github.com/rusiaaman/chat.md/blob/main/samples/gemin...
[3] https://github.com/rusiaaman/chat.md/blob/main/samples/llama...
I think it did very well - it's clearly good at instruction following.
Total token cost: 11,758 input, 2,743 output = 4.546 cents.
Same experiment run with GPT-4.1 mini: https://gist.github.com/simonw/325e6e5e63d449cc5394e92b8f2a3... (0.8802 cents)
And GPT-4.1 nano: https://gist.github.com/simonw/1d19f034edf285a788245b7b08734... (0.2018 cents)
[1] https://llm.datasette.io/en/stable/plugins/directory.html#fr...
- telling the model to be persistent (+20%)
- dont self-inject/parse toolcalls (+2%)
- prompted planning (+4%)
- JSON BAD - use XML or arxiv 2406.13121 (GDM format)
- put instructions + user query at TOP -and- BOTTOM - bottom-only is VERY BAD
- no evidence that ALL CAPS or Bribes or Tips or threats to grandma work
source: https://cookbook.openai.com/examples/gpt4-1_prompting_guide#...
If the instructions are at the top the LV cache entries can be pre computed and cached.
If they’re at the bottom the entries at the lower layers will have a dependency on the user input.
[Long system instructions - 200 tokens]
[Very long document for reference - 5000 tokens]
[User query - 32 tokens]
The key-values for first 5200 tokens can be cached and it's efficient to swap out the user query for a different one, you only need to prefill 32 tokens and generate output.But the recommendation is to use this, where in this case you can only cache the first 200 tokens and need to prefill 5264 tokens every time the user submits a new query.
[Long system instructions - 200 tokens]
[User query - 32 tokens]
[Very long document for reference - 5000 tokens]
[Long system instructions - 200 tokens]
[User query - 32 tokens]
and we'll be publishing our 4.1 pod later today https://www.youtube.com/@latentspacepod
It's just not how I like to work.
As an aside, I was working in the games industry when multi-core was brand new. Maybe Xbox-360 and PS3? I'm hazy on the exact consoles but there was one generation where the major platforms all went multi-core.
No one knew how to best use the multi-core systems for gaming. I attended numerous tech talks by teams that had tried different approaches and were give similar "maybe do this and maybe see x% improvement?". There was a lot of experimentation. It took a few years before things settled and best practices became even somewhat standardized.
Some people found that era frustrating and didn't like to work in that way. Others loved the fact it was a wide open field of study where they could discover things.
Although it took many, many more years until games started to actually use multi-core properly. With rendering being on a 16.67ms / 8.33ms budget and rendering tied to world state, it was just really hard to not tie everything into eachother.
Even today you'll usually only see 2-4 cores actually getting significant load.
Meanwhile, nobody can agree on what a "good" LLM in, let alone how to measure it.
I mean that seems wild to say to me. Those architectures have documentation and aren't magic black boxes that we chuck inputs at and hope for the best: we do pretty much that with LLMs.
If that's how you optimise, I'm genuinely shocked.
And while the kind of errors hasn’t changed, the quantity and severity of the errors has dropped dramatically in a relatively short span of time.
In my experience, even simple CRUD apps generally have some domain-specific intricacies or edge cases that take some amount of experimentation to get right.
From my experience, even building on popular platforms, there are many bugs or poorly documented behaviors in core controls or APIs.
And performance issues in particular can be difficult to fix without trial and error.
The advantage is that humans are probabilistic, mercurial and unreliable, and LLMs are a way to bridge the gap between humans and machines that, while not wholly reliable, makes the gap much smaller than it used to be.
If you're not making software that interacts with humans or their fuzzy outputs (text, images, voice etc.), and have the luxury of well defined schema, you're not going to see the advantage side.
But unfortunately for us, clean and logical classical methods suck ass in comparison so we have no other choice but to deal with the uncertainty.
And yet, all function calling and MCP is done through JSON...
What is meant by this?
Challenge accepted.
That said, the exact quote from the linked notebook is "It’s generally not necessary to use all-caps or other incentives like bribes or tips, but developers can experiment with this for extra emphasis if so desired.", but the demo examples OpenAI provides do like using ALL CAPS.
Not all systems upgrade every few months. A major question is when we reach step-improvements in performance warranting a re-eval, redesign of prompts, etc.
There's a small bleeding edge, and a much larger number of followers.
Lies, damn lies and statistics ;-)
First it was the models stopped putting in effort and felt lazy, tell it to do something and it will tell you to do it your self. Now its the opposite and the models go ham changing everything they see, instead of changing one line, SOTA models rather rewrite the whole project and still not fix the issue.
Two years back I totally thought these models are amazing. I always would test out the newest models and would get hyped up about it. Every problem i had i thought if i just prompt it differently I can get it to solve this. Often times i have spent hours prompting starting new chats, adding more context. Now i realize its kinda useless and its better to just accept the models where they are, rather then try and make them a one stop shop, or try to stretch capabilities.
I think this release I won’t even test it out, im not interested anymore. I’ll probably just continue using deepseek free, and gemini free. I canceled my openai subscription like 6 months ago, and canceled claude after 3.7 disappointment.
I probably spend 100$ a month on AI coding, and it's great at small straightforward tasks.
Drop it into a larger codebase and it'll get confused. Even if the same tool built it in the first place due to context limits.
Then again, the way things are rapidly improving I suspect I can wait 6 months and they'll have a model that can do what I want.
Similar to the function documentation provides to developers today, I suppose.
This helps the model to focus on a subset of codebase thst is relevant to the current task.
(I built it)
But I guess it could be interpreted differently like you said.
I believe this. I've been having the forgetting problem happen less with Gemini 2.5 Pro. It does hallucinate, but I can get far just pasting all the docs and a few examples, and asking it to double check everything according to the docs instead of relying on its memory.
The promise of AI is I can spend 100$ to get 40 hours or so of work done.
• GPT-4.1-mini: balances performance, speed & cost
• GPT-4.1-nano: prioritizes throughput & low cost with streamlined capabilities
All share a 1 million‑token context window (vs 120–200k on 4o-o3/o1), excelling in instruction following, tool calls & coding.
Benchmarks vs prior models:
• AIME ’24: 48.1% vs 13.1% (~3.7× gain)
• MMLU: 90.2% vs 85.7% (+4.5 pp)
• Video‑MME: 72.0% vs 65.3% (+6.7 pp)
• SWE‑bench Verified: 54.6% vs 33.2% (+21.4 pp)
It seems like OpenAI keeps changing its plans. Deprecating GPT-4.5 less than 2 months after introducing it also seems unlikely to be the original plan. Changing plans is necessarily a bad thing, but I wonder why.
Did they not expect this model to turn out as well as it did?
[1] https://x.com/sama/status/1889755723078443244
[2] https://github.com/openai/openai-cookbook/blob/6a47d53c967a0...
There doesn't appear to be anything that these AI models cannot do, in principle, given sufficient data and compute. They've figured out multimodality and complex integration, self play for arbitrary domains, and lots of high-cost longer term paradigms that will push capabilities forwards for at least 2 decades in conjunction with Moore's law.
Things are going to continue getting better, faster, and weirder. If someone is making confident predictions beyond those claims, it's probably their job.
Maybe
1. he's just doing his job and hyping OpenAI's competitive advantages (afair most of the competition didn't have decent COT models in Feb), or
2. something changed and they're releasing models now that they didn't intend to release 2 months ago (maybe because a model they did intend to release is not ready and won't be for a while), or
3. COT is not really as advantageous as it was deemed to be 2+ months ago and/or computationally too expensive.
(Not to say that it takes openai years to train a new model, just that the timeline between major GPT releases seems to double... be it for data gathering, training, taking breaks between training generations, ... - either way, model training seems to get harder not easier).
GPT Model | Release Date | Months Passed Between Former Model
GPT-1 | 11.06.2018
GPT-2 | 14.02.2019 | 8.16
GPT-3 | 28.05.2020 | 15.43
GPT-4 | 14.03.2023 | 33.55
[1]https://www.lesswrong.com/posts/BWMKzBunEhMGfpEgo/when-will-...
I'm talking more broadly, as well, including consideration of audio, video, and image modalities, general robotics models, and the momentum behind applying some of these architectures to novel domains. Protocols like MCP and automation tooling are rapidly improving, with media production and IT work rapidly being automated wherever possible. When you throw in the chemistry and materials science advances, protein modeling, etc - we have enormously powerful AI with insufficient compute and expertise to apply it to everything we might want to. We have research being done on alternate architectures, and optimization being done on transformers that are rapidly reducing the cost/performance ratio. There are models that you can run on phones that would have been considered AGI 10 years ago, and there doesn't seem to be any fundamental principle decreasing the rate of improvement yet. If alternate architectures like RWKV get funded, there might be several orders of magnitude improvement with relatively little disruption to production model behaviors, but other architectures like text diffusion could obsolete a lot of the ecosystem being built up around LLMs right now.
There are a million little considerations pumping transformer LLMs right now because they work and there's every reason to expect them to continue improving in performance and value for at least a decade. There aren't enough researchers and there's not enough compute to saturate the industry.
I love this. Especially the weirder part. This tech can be useful in every crevice of society and we still have no idea what new creative use cases there are.
Who would’ve guessed phones and social media would cause mass protests because bystanders could record and distribute videos of the police?
That would have been quite far down on my list of "major (unexpected) consequences of phones and social media"...
Not necessarily progress or benchmarks that as a broader picture you would look at (MMLU etc)
GPT-3 was an amazing step up from GPT-2, something scientists in the field really thought was 10-15 years out at least done in 2, instruct/RHLF for GPTs was a similar massive splash, making the second half of 2021 equally amazing.
However nothing since has really been that left field or unpredictable from then, and it's been almost 3 years since RHLF hit the field. We knew good image understanding as input, longer context, and improved prompting would improve results. The releases are common, but the progress feels like it has stalled for me.
What really has changed since Davinci-instruct or ChatGPT to you? When making an AI-using product, do you construct it differently? Are agents presently more than APIs talking to databases with private fields?
Image generation suddenly went from gimmick to useful now that prompt adherence is so much better (eagerly waiting for that to be in the API)
Coding performance continues to improve noticeably (for me). Claude 3.7 felt like a big step from 4o/3.5. Gemini 2.5 in a similar way.compared to just 6 months ago I can give bigger and more complex pieces of work to it and get relatively good output back. (Net acceleration)
Audio-2-audio seems like it will be a big step as well. I think this has much more potential than the STT-LLM-TTS architecture commonly used today (latency, quality)
Well they actually hinted already of possible depreciation in their initial announcement of gpt4.5 [0]. Also, as others said, this model was already offered in the api as chatgpt-latest, but there was no checkpoint which made it unreliable for actual use.
[0] https://openai.com/index/introducing-gpt-4-5/#:~:text=we%E2%...
While their competitors have made fantastic models, at the time I perceived ChatGPT4 was the best model for many applications. COT was often tricked by my prompts, assuming things to be true, when a non-COT model would say something like 'That isnt necessarily the case'.
I use both COT and non when I have an important problem.
Seeing them keep a non-COT model around is a good idea.
@sama: underrated tweet
Source: https://x.com/stevenheidel/status/1911833398588719274
*Then fix all your prompts over the next two weeks.
https://platform.openai.com/docs/models/gpt-4.1
https://platform.openai.com/docs/models/gpt-4.1-mini
https://platform.openai.com/docs/models/gpt-4.1-nano
I will check again the prompt, maybe 4o-mini ignores some instructions that 4.1 doesn't (instructions which might result in the LLM returning zero data).
1, to win consumer growth they have continued to benefit on hyper viral moments, lately that was was image generation in 4o, which likely was technically possible a long time before launched. 2, for enterprise workloads and large API use, they seem to have focused less lately but the pricing of 4.1 is clearly an answer to Gemini which has been winning on ultra high volume and consistency. 3, for full frontier benchmarks they pushed out 4.5 to stay SOTA and attract the best researchers. 4, on top of all they they had to, and did, quickly answer the reasoning promise and DeepSeek threat with faster and cheaper o models.
They are still winning many of these battles but history highlights how hard multi front warfare is, at least for teams of humans.
4.1 is 26.6% better at coding than 4.5. Got it. Also…see the em dash
> You're eligible for free daily usage on traffic shared with OpenAI through April 30, 2025.
> Up to 1 million tokens per day across gpt-4.5-preview, gpt-4.1, gpt-4o and o1
> Up to 10 million tokens per day across gpt-4.1-mini, gpt-4.1-nano, gpt-4o-mini, o1-mini and o3-mini
> Usage beyond these limits, as well as usage for other models, will be billed at standard rates. Some limitations apply.
I just found this option in https://platform.openai.com/settings/organization/data-contr...Is just this something I haven't noticed before? Or is this new?
O1 is 15$ in 60$ out.
So you could easily get 75+$ per day free from this.
"o" means "omni", which means its multimodal.
{"error":
{"message":"Quasar and Optimus were stealth models, and
revealed on April 14th as early testing versions of GPT 4.1.
Check it out: https://openrouter.ai/openai/gpt-4.1","code":404}
- 4o (can search the web, use Canvas, evaluate Python server-side, generate images, but has no chain of thought)
- o3-mini (web search, CoT, canvas, but no image generation)
- o1 (CoT, maybe better than o3, but no canvas or web search and also no images)
- Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)
- 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)
- 4o "with scheduled tasks" (why on earth is that a model and not a tool that the other models can use!?)
Why do I have to figure all of this out myself?
o1-pro: anything important involving accuracy or reasoning. Does the best at accomplishing things correctly in one go even with lots of context.
deepseek R1: anything where I want high quality non-academic prose or poetry. Hands down the best model for these. Also very solid for fast and interesting analytical takes. I love bouncing ideas around with R1 and Grok-3 bc of their fast responses and reasoning. I think R1 is the most creative yet also the best at mimicking prose styles and tone. I've speculated that Grok-3 is R1 with mods and think it's reasonably likely.
4o: image generation, occasionally something else but never for code or analysis. Can't wait till it can generate accurate technical diagrams from text.
o3-mini-high and grok-3: code or analysis that I don't want to wait for o1-pro to complete.
claude 3.7: occasionally for code if the other models are making lots of errors. Sometimes models will anchor to outdated information in spite of being informed of newer information.
gemini models: occasionally I test to see if they are competitive, so far not really, though I sense they are good at certain things. Excited to try 2.5 Deep Research more, as it seems promising.
Perplexity: discontinued subscription once the search functionality in other models improved.
I'm really looking forward to o3-pro. Let's hope it's available soon as there are some things I'm working on that are on hold waiting for it.
I'm not really certain a text output model can ever do well here.
1. User provides information
2. LLM generates structured output for whatever modeling language
3. Same or other multimodal LLM reviews the generated graph for styling / positioning issues and ensure its matches user request.
4. LLM generates structured output based on the feedback.
5. etc...
But you could probably fine-tune a multimodal model to do it in one shot, or way more effectively.
Has become my go to for use in Cursor. Claude 3.7 needs to be restrained too much.
And it often just stops like “ok this is still not working. You fix it and tell me when it’s done so I can continue”.
But for coding: Gemini Pro 2.5 > Sonnet 3.5 > Sonnet 3.7
In my experience whenever these models solve a math or logic puzzle with reasoning, they generate extremely long and convoluted chains of thought which show up in the solution.
In contrast a human would come up with a solution with 2-3 steps. Perhaps something similar is going on here with the generated code.
Sadly stopped my subscription, when you removed the ability to weight my own domains...
Otherwise the fine-tune for your output format for technical questions is great, with the options, the pro/contra and the mermaid diagrams. Just way better for technical searches, than what all the generic services can provide.
Same here, which is a real shame. I've switched to DeepResearch with Gemini 2.5 Pro over the last few days where paid users have a 20/day limit instead of 10/month and it's been great, especially since now Gemini seems to browse 10x more pages than OpenAI Deep Research (on the order of 200-400 pages versus 20-40).
The reports are too verbose but having it research random development ideas, or how to do something particularly complex with a specific library, or different approaches or architectures to a problem has been very productive without sliding into vibe coding territory.
My son and I go to a lot of concerts and collect patches. Unfortunately we started collecting long after we started going to concerts.
I had a list of about 30 bands I wanted patches for.
I was able to give precise instructions on what I wanted. Deep research came back with direct links for every patch I wanted.
It took me two minutes to write up the prompt and it did all the heavy lifting.
I'm all-in on Deep Research. It can conduct research on niche historical topics that have no central articles in minutes, which typically were taking me days or weeks to delve into.
What I love most about history is it has lots of irreducible complexity and poring over the literature, both primary and secondary sources, is often the only way to develop an understanding.
For example when I've wanted to understand an unfolding story better than the news, I've told it to ignore the media and go only to original sources (e.g. speech transcripts, material written by the people involved, etc.)
Because it's quite long, if I asked Perplexity* to remind me what something meant, it would very rarely return something helpful, but, to be fair, I cant really fault it for being a bit useless with a very difficult to comprehend text, where there are several competing styles of reading, many of whom are convinced they are correct.
But I started to notice a pattern of where it would pull answers from some weird spots, especially when I asked it to do deep research. Like, a paper from a University's server that's using concepts in the book to ground qualitative research, which is fine and practical explications are often useful ways into a dense concept, but it's kinda a really weird place to be the first initial academic source. It'll draw on Reddit a weird amount too, or it'll somehow pull a page of definitions from a handout for some University tutorial. And it wont default to the peer reviewed free philosophy encyclopedias that are online and well known.
It's just weird. I was just using it to try and reinforce my actual reading of the text but I more came away thinking that in certain domains, this end of AI is allowing people to conflate having access to information, with learning about something.
*it's just what I have access to.
So something like this: "Here's a PDF file containing Being and Time. Please explain the significance of anxiety (Angst) in the uncovering of Being."
Is that an LLM hallucination?
Ha! That's the funniest and best description of 4.5 I've seen.
I prefer to use only open source models that don't have the possibility to share my data with a third party.
Fully private and local inference is indeed great, but of the centralized players, Google, Microsoft, and Apple are leagues ahead of the newer generation in conservatism and care around personal data.
If they abstract all this away into one interface I won't know which model I'm getting. I prefer reliability.
GPT‑4.1, GPT‑4.1 mini GPT‑4.1 nano
I'll start with
800 bn MoE (probably 120 bn activated), 200 bn MoE (33 bn activated), and 7bn parameter for nano
My take aways:
- This is the first model from OpenAI that feels relatively agentic to me (o3-mini sucks at tool use, 4o just sucks). It seems to be able to piece together several tools to reach the desired goal and follows a roughly coherent plan.
- There is still more work to do here. Despite OpenAI's cookbook[0] and some prompt engineering on my side, GPT-4.1 stops quickly to ask questions, getting into a quite useless "convo mode". Its tool calls fails way too often as well in my opinion.
- It's also able to handle significantly less complexity than Claude, resulting in some comical failures. Where Claude would create server endpoints, frontend components and routes and connect the two, GPT-4.1 creates simplistic UI that calls a mock API despite explicit instructions. When prompted to fix it, it went haywire and couldn't handle the multiple scopes involved in that test app.
- With that said, within all these parameters, it's much less unnerving than Claude and it sticks to the request, as long as the request is not too complex.
My conclusion: I like it, and totally see where it shines, narrow targeted work, adding to Claude 3.7 - for creative work, and Gemini 2.5 Pro for deep complex tasks. GPT-4.1 does feel like a smaller model compared to these last two, but maybe I just need to use it for longer.
0: https://cookbook.openai.com/examples/gpt4-1_prompting_guide
I hope they release a distillation of 4.5 that uses the same training approach; that might be a pretty decent model.
For me, it was jaw dropping. Perhaps he didn't mean it the way it sounded, but seemed like a major shift to me.
Their value is firmly rooted in how they wrap ux around models.
We are in a race to make a new God, and the company that wins the race will have omnipotent power beyond our comprehension.
After everyone else caught up: The models come and go, some are SOTA in evals and some not. What matters is our platform and market share.
We will also begin deprecating GPT‑4.5 Preview in the API, as GPT‑4.1 offers improved or similar performance on many key capabilities at much lower cost and latency. GPT‑4.5 Preview will be turned off in three months
Here's something I just don't understand, how can ChatGPT 4.5 be worse than 4.1? Or the only thing bad is that the OpenAI naming ability?- It's basically GPT4o level on average.
- More optimized for coding, but slightly inferior in other areas.
It seems to be a better model than 4o for coding tasks, but I'm not sure if it will replace the current leaders -- Gemini 2.5 Pro, o3-mini / o1, Claude 3.7/3.5.
Broad Knowledge 25.1 Coder: Larger Problems 25.1 Coder: Line focused 25.1
Sam acknowledged this a few months ago, but with another release not really bringing any clarity, this is getting ridiculous now.
Wait, wouldn’t this be a decent test for reasoning ?
Every patch changes things, and there’s massive complexity with the various interactions between items, uniques, runes, and more.
Getting better at code is something you can verify automatically, same for diff formats and custom response formats. Instruction following is also either automatically verifiable, or can be verified via LLM as a judge.
I strongly suspect that this model is a GPT-4.5 (or GPT-5???) distill, with the traditional pretrain -> SFT -> RLHF pipeline augmented with an RLVR stage, as described in Lambert et al[1], and a bunch of boring technical infrastructure improvements sprinkled on top.
It is just such a big failure of OpenAI not to include smart routing on each question and hide the complexity of choosing a model from users.
Would be nice if there was at least some hint as to where T3 Tools Inc. is located and what jurisdiction applies.
https://platform.openai.com/docs/models/gpt-4.1
As far as I can tell there's no way to discover the details of a model via the API right now.
Given the announced adoption of MCP and MCP's ability to perform model selection for Sampling based on a ranking for speed and intelligence, it would be great to have a model discovery endpoint that came with all the details on that page.
The graphs presented don't even show a clear winner across all categories. The one with the biggest "number", GPT-4.5, isn't even in the best in most categories, actually it's like 3rd in a lot of them.
This is quite confusing as a user.
Otherwise big fan of OAI products thus far. I keep paying $20/mo, they keep improving across the board.
porphyra•4d ago
mhh__•4d ago
hmottestad•4d ago
asdev•4d ago
brokensegue•4d ago
and it ties on a lot of benchmarks
asdev•4d ago
brokensegue•4d ago
porphyra•3d ago
> One last note: we’ll also begin deprecating GPT-4.5 Preview in the API today as GPT-4.1 offers improved or similar performance on many key capabilities at lower latency and cost. GPT-4.5 in the API will be turned off in three months, on July 14, to allow time to transition (and GPT 4.5 will continue to be available in ChatGPT).
https://x.com/OpenAIDevs/status/1911860805810716929