Gemini 3.0 Pro – early tests

https://twitter.com/chetaslua/status/1973694615518880236

217•ukuina•4mo ago

Comments

simonw•4mo ago

I've seen a bunch of tweets like this recently, as far as I can tell they're all from people using https://aistudio.google.com/ who got served an A/B test.

A few more in this genre:

https://x.com/cannn064/status/1973818263168852146 - "Make a SVG of a PlayStation 4 controller"

https://x.com/cannn064/status/1973415142302830878 "Create a single, self-contained HTML5 file that mimics a macOS Sonoma-style desktop: translucent menu bar with live clock, magnifying dock, draggable/resizable windows, and a dynamic wallpaper. No external assets; use inline SVG for icons."

https://x.com/synthwavedd/status/1973405539708056022 "Write full HTML, CSS and Javascript for a very realistic page on Apple's website for the new iPhone 18"

I've not seen it myself so I'm not sure how confident they are that it's Gemini 3.0.

ceejayoz•4mo ago

> a very realistic page on Apple's website…

Is this supposed to be a good example?

It looks like something I'd put together, and you don't want me doing design work.

ajcp•4mo ago

At this point until I see one run through the Pelican Benchmark I can't really take a new model seriously.

diggan•4mo ago

Unfortunately, as every public benchmark, once it ends up in the training sets and/or the developers aware of it, it stops being effective, and I think we've started to reach that point.

The only thing I've found to give me some sort of quantitative idea of how good a new model is, is my own private benchmarks. It doesn't cover everything I want to use LLMs for, and only has 20-30 tests per "category", but at least I'm 99% sure it isn't in the training datasets.

simonw•4mo ago

I have a few "SVG of an X riding a Y" tests that I don't publish online which I run occasionally to see if a model is suspiciously better at drawing a pelican riding a bicycle than some other creature on some other form of transport.

I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!

ajcp•4mo ago

-> I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!

Que intro: "The gang wastes their time cheating on a dumb benchmark"

mcny•4mo ago

A shower thought I just had: there must be some AI training company somewhere that has injested all It is always sunny in Philadelphia, not just the text but all the video from all episodes somehow...

Imustaskforhelp•4mo ago

Please do let us know through your blog post if you ever find AI labs to cheat on your benchmark.

But now I am worried that since you have shared that you do SVG of an X riding a Y thing, maybe these models will try to cheat on the whole SVG of X riding Y thing instead of hyper focusing the pelican.

So now I suppose you might need to come up with an entirely new thing though :)

throwup238•4mo ago

There are so many X and Y combinations that I find it hard to believe they could realistically train for a even a small fraction of them. Someone has to generate the graphics output for the training.

A duck billed platypus riding a unicycle? A man o' war riding a pyrosome? A chicken riding a Quetzalcoatlus? A tardigrade riding a surf board?

gnatolf•4mo ago

You're assuming that given the collection of simonw's publicly available blog posts, the creativity of those combinations can't be narrowed down. Simply reverse engineer his brain this way and you'll get your Xs and Ys ;)

throwup238•4mo ago

I feel like that would over fit on various snakes like pythons.

fragmede•4mo ago

If we accept ChatGPT telling me that there are approximately 200k common nouns in English, and then we square that, we get 40 billion combinations. At one second per, that's ~1200 years, but then if we parallelize it on a supercomputer that can do 100,000 per second that would only take 3 days. Given that ChatGPT was trained on all of the Internet and every book written, I'm not sure that still seems infeasible.

throwup238•4mo ago

It still can't satisfactorily draw a pelican on a bicycle because that's either not in the training data or the signal is too weak, so why would it be able to satisfactorily draw every random noun-riding-noun combination just because you threw a for loop at it?

The point is that in order to cheat on @simonw's benchmark across any arbitrary combination, they'd have to come up with an absurd number of human crafted input-output training pairs with human produced drawings. You can't just ask ChatGPT to generate every combination because all it'll produce is garbage that gets a lot worse the further from a pelican riding a bicycle.

It might work at first for the pelican and a few other animals/transport combination but what does it even mean for a man o' war riding a pyrosome? I asked every model I have access to generate an SVG for a "man o' war riding a pyrosome" and not a single one managed to draw anything resembling a pyrosome. Most couldn't even produce something resembling a man o' war except as a generic ellipsoid-shaped jellyfish with a few tenticles.

Expand that to every weird noun-noun combination and it's just not practical to train even a tiny fraction of them.

fragmede•4mo ago

https://chatgpt.com/share/68def5c5-8ca4-8009-bbca-feabbe0651...

Man'o'war on a pyrosome. I don't what you expected it to look like, maybe it could be more whiteish translucent instead of orange, but it looks fairly reasonable to me. Took a bit over a minute with the ChatGPT app.

Simonw's test is for the text-only output from an LLM to write an SVG, not "can a multimodal AI in 2025" generate a PNG. By having pictures of pelicans on bicycles in the training data in PNG format, from people wanting to see one, after reading his blog, there are now raster-based images from an image generation model that fairly convincingly look as described in the training data. Now that there's PNGs of pelicans on bicycles, we would expect GPT-6 to be better at generating SVGs of something it's already "seen".

We don't know what simonw's secret combo X and Y is, nor do I want to know, because that would ruin the benchmark (if it isn't ruined already by virtue of him having asked it). 200k nouns is definitely high though. A bit of thought could cut it down to exclude concepts and lot of other things. How much spare GPU capacity OpenAI has, I have no idea. But if I were there, I'd want the GPUs to be running as hot as the cloud provider would let me run them, because they're paying per hour, not per watt, and have a low-priority queue of jobs for employees to generate whatever extra training data they can think of on their off hours.

Oh and here's the pelican PNG so the other platforms can crawl this comment and slurp it up.

https://chatgpt.com/share/68def958-3008-8009-91fa-99127fc053...

brianjking•4mo ago

I must say that I loved the idea of a tardigrade riding a surfboard. You're welcome.

Granted not an SVG, but still awesome.

https://imgur.com/a/KsbyVNP

diggan•4mo ago

> I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!

I don't think it's necessarily "cheating", it just happens as they're discovering and ingesting large ranges of content. A problem of public content, it's bound to be included sooner or later, directly or indirectly.

Nice to hear you're doing some sort of contingency though, and looking forward to the inevitable blog post announcing the change to a different bird and vehicle :)

svachalek•4mo ago

The thing is most of the discussion about it is embarrassingly bad SVGs so training on them would actually hurt their performance.

JSR_FDED•4mo ago

Regrettably AI is still better at SVG than I am

reissbaker•4mo ago

I doubt they'd cheat that obviously... But "SVG of X" has become common enough that I suspect most frontier labs train on it, especially since the models are multimodal now anyway.

Not that I mind; I want models to be good at generating SVG! Makes icons much simpler.

fragmede•4mo ago

But how would you know it's from what you would consider cheating as opposed to pelicans on bicycles existing in the latest training data? Obviously your blog gets fed into the training set for GPT-6, as well as everyone else talking about your test, so how would the comparison to a secret X riding a Y tell you if an AI lab is cheating as opposed to merely there being more examples in the training data?

simonw•4mo ago

Mainly because if they train on the pelican on bicycle SVGs from my blog they are going to get some very weird looking pelicans riding some terrible looking bicycles.

fragmede•4mo ago

It's not that I claiming they're training on SVG pelicans on bicycles from your blog, it's that thanks to your popularity, there are simply now more pictures of pelicans on bicycles floating around on the Internet and thus ChatGPT's training data. Eg https://www.reddit.com/r/ColoredPencils/comments/1l9l4fq/pel...

How would you determine that improvements to SVG pelicans on bicycles (and not your secret X on Ys) are from an OpenAI employee cheating your benchmark vs being an improvement on pelicans on bicycles thanks to that picture from Reddit and everywhere elsewhere in the training data?

simonw•4mo ago

See comment here: https://news.ycombinator.com/item?id=45454269

jgalt212•4mo ago

Your benchmark may or may not be dumb, but it is definitely widely followed. So much so this is what Bing AI has to say on the matter.

> Absolutely — the “pelican riding a bicycle” SVG test is a quirky but clever benchmark created by Simon Willison to evaluate how well different large language models (LLMs) can generate SVG (Scalable Vector Graphics) images from a prompt that’s both unusual and unlikely to be in their training data.

ajcp•4mo ago

That's the move right there.

latemedium•4mo ago

We need to know if big AI labs are explicitly training models to generate SVGs of pelicans on bicycles. I wouldn't put it past them. But it would be pretty wild in they did!

londons_explore•4mo ago

As soon as you use your private tests, all the AI companies vacuum up the input to use to train the next model.

Obviously they're only getting the question and not a perfect answer, but with today's process of generating hundreds of potential answers and getting another model to choose the best/correct one for training, I don't think that matters.

astrange•4mo ago

Are the models capable of judging a good SVG? They can't read ASCII art.

londons_explore•4mo ago

If you give the 'judge' models tool use, they could easily fire up a web browser to render an SVG and then use imagenet or something to see how 'pelican-y' the result is.

Workaccount2•4mo ago

I honestly think people really blow out of proportion the effect of "being in the training set". The internet is ridden with examples of problem/solution posts that many models definitely trained on, but still get wrong.

More important would be post training, where the labs specifically train on the exact question. But it doesn't seem like this is happening for most amateur benchmarks at least. All the models that are good at pelican bike have been good at whatever else you throw at them to SVG.

epolanski•4mo ago

Jm2c I couldn't care less about this vibecode style benchmarks and I'm sick of them.

The things that I see represented here, may or may not be impressive, but sure as hell have never been the major blockers in achieving progress on complex tasks and software.

I understand you're merely reporting, thank you for that, not criticizing you, but those tests are absolutely irrelevant.

mnk47•4mo ago

In my experience, the model's performance in silly tasks like these is usually (not always) correlated with its performance in other areas except tool use/agent stuff.

esafak•4mo ago

We can't see the code and the challenge is pedestrian. Nothing to see here.

Oras•4mo ago

These tests mean nothing; I yet to see a model that is better than Sonnet 4 for coding. I tried many, all of them are sub-par, even with a small code base.

nnevatie•4mo ago

Well, Codex with GPT5 High wins Claude Sonnet 4.5 - this is anecdotal, but I've used both extensively.

solarkraft•4mo ago

At what speed? At some point you’ll have to compare to Opus.

adastra22•4mo ago

And Sonnet 4.5 is better than Opus.

Bolwin•4mo ago

Well yeah no surprise. You should try glm 4.6

Oras•4mo ago

I tried it, and it was shockingly bad compared to their benchmarks and to Claude Sonnet 4.

I tried it with Claude Code CLI, it didn't follow instructions correctly (I had a Claude.md file with clear instructions), stopped after a few implementations (less than 3 minutes), and produced code that does not work.

For the benefit of the doubt, I changed instructions to be NextJS platform as I thought it's a known framework and it might do better, but still, same quality issues.

adastra22•4mo ago

Well, Sonnet 4.5 is better.

strongpigeon•4mo ago

Google's biggest problem in my opinion (and I'm saying that as an ex-googler) is that Google doesn't have a product culture. Google had the tech for something like ChatGPT for a long time, but couldn't come up with that product. Instead it had to rely on another company showing it the way and then copy them and try to out-engineer them...

I still think ultimately (and somewhat sadly) Google will win the AI race due to its engineering talent and the sheer amount of data it has (and Android integration potential).

sho_hn•4mo ago

To be fair, according to OpenAI they started ChatGPT as a demo/experiment and were taken by surprise when it went viral.

It may well be that they also didn't have a product culture as an organization, but were willing to experiment or let small teams do so.

It's still a lesson, but maybe a different one.

With organizational scale it becomes harder and harder to launch experiments under the brand. Red tape increases, outside scrutiny increases. Retaining the ability to do that is difficult.

Google does experiment a fair bit (including in AI, e.g. NotebookLLM and its podcast feature are I think a standout example of trying to see what sticks) but they also tend to try to hide their experiments in developer portals nowadays, which makes it difficult to get a signal from a general consumer audience.

strongpigeon•4mo ago

Google is definitely good at experimenting (and yeah NotebookLLM is really cool), which is a product of the bottom-up culture. The lack of a consistent story with regard to AI products however is a testament to the lack of product vision from the top.

ajcp•4mo ago

NotebookLM came out of Google Labs though, and in collaboration with outside stakeholders. I'm not sure I would call it a success of "bottom-up" culture, but a well realized idea from a dedicated incubator. That doesn't necessarily mean the rest of the company is so empowered or product oriented.

ajcp•4mo ago

-> With organizational scale it becomes harder and harder to launch experiments under the brand

I feel like Google tried to solve for this with their `withgoogle.com` domain and it just ends up being confusing or worse still, frustrating when you see something awesome and then nothing ever comes of it.

thereitgoes456•4mo ago

According to Karen Hao's Empire of AI, this is only half accurate. And I trust what Karen Hao says a lot more.

OpenAI mistakenly thought Anthropic was about to launch a chatbot, and ChatGPT was a scrappy, rushed-out-the-door product made from an intermediate version of GPT-4, meant to one-up them. Of course, they were surprised at how popular it became.

kristianp•4mo ago

Do you mean an intermediate version of GPT-3? That's more the timeline I'm thinking.

dudeinhawaii•4mo ago

If I can take a slight tangent. This is what I will remember OpenAI for. Not the Closed vs Open debate. They caused the democratization of access to AI models. Prior to ChatGPT, I would hear about these great models Deep Mind and Google were developing. They'd always stay closed behind the walls of Google.

OpenAI forced Google to release and as a result, we have all of the AI tooling, integrations, and models. Meta's leaning into the stolen Llama code took this further and sparked the Open Source LLM revolution (in addition to the myriad contributors and researchers who built on that).

If we had left it to Google, I suspect they'd release tooling (as they did with TensorFlow) but not an LLM that might compete with their core product..

byefruit•4mo ago

And even when it does copy other products, it seems to be doing a terrible job of them.

Google's AI offering is a complete nightmare to use. Three different APIs, at least two different subscriptions, documentation that uses them interchangeably.

For Gemini's API it's often much simpler to actually pay OpenRouter the 5% surchargeto BYOK than deal with it all.

I still can't use my Google AI Pro account with gemini-cli..

cshores•4mo ago

As of this week you can use gemini-cli with Google AI Pro

gardnr•4mo ago

Then there's the billing dashboards...

It's amazing how they can show useless data while completely obfuscating what matters.

ur-whale•4mo ago

Yeah, the whole billing death march is what ended up making me pick OpenAI as my main worhorse instead of GOOG.

Not enough brain cycles to figure out a way to give Google money, whereas the OpenAI subscription was basically a no-brainer.

specproc•4mo ago

I had great fun this week with the batch API. A good morning lost trying to work out how to do a not particularly complex batch request via JSONL.

The python library is not well documented, and has some pretty basic issues that need looking at. Terrible, unhelpful errors, and "oh, so this works if I put it in camel-case" sort of stuff.

leobg•4mo ago

litellm + gemini API key?

I find Gemini is their first API that works like that. Not like their pre-Gemini vision, speech recognition, sheets etc.. Those were/are a nightmare to set up indeed.

xnx•4mo ago

> Google doesn't have a product culture

Fair criticism that it took someone else to make something of the tech that Google initially invented, but Google is furiously experimenting with all their active products since Sundar's "code red" memo.

adventured•4mo ago

Along with its engineering talent and resource scale, I think their in-house chips are one of their core advantages. They can scale in a way that their peers are going to struggle to match, and at much lower cost. Nvidia's extreme margins are Google's opportunity.

renewiltord•4mo ago

Well, they had an internal ethics team that told them that their technology was garbage. That can't help. The other guys' ethics teams are all like "Our stuff is too awesome for people to use. No one should have this kind of unbridled power. We must muzzle the beast before a tourist rides him" and Google's ethics team was like "our shit sucks lol this is just a Markov chain parrot doesn't do shit it's garbage".

Filligree•4mo ago

Which, to be fair—we're talking about the pre-GPT-3.5 era—it kind of was?

renewiltord•4mo ago

The unfortunate truth when you're on the cusp of a new technology: it isn't good yet. Keeping a team of guys around whose sole job it is to tell you your stuff sucks is probably not aligned with producing good stuff.

elcritch•4mo ago

There's almost like an "uncanny valley" type situation with good products. As in new technologies start out promising, but less okay. Then as they get better they becomes close to being a "good project" the more it's not there yet. In that way it could feel sort of worse than a mediocre project. Until it's done.

nicr_22•4mo ago

There's a world of difference between saying "our stuff sucks" vs "here are the specific ways our stuff isn't ready for launch". The former is just whining, the latter is what a good PM does.

charcircuit•4mo ago

Don't you remember all of the scaremongering around how unethical it would be to release a GPT3 model publicly.

Google personally reached out to someone trying to reproduce GPT3 and convinced him to abandon his plan of releasing it to the public.

Imustaskforhelp•4mo ago

And here we are after deepseek and the qwen models and so so much more like glm 4.6 which are reaching sota of sorts.

mlsu•4mo ago

There was scaremongering about releasing GPT-2.

GPT-2!!

charcircuit•4mo ago

You're right. I was remembering gpt2 and it was OpenAI that reached out. He was in contact with Google to get the training compute.

https://medium.com/@NPCollapse/the-hacker-learns-to-trust-62...

pixl97•4mo ago

I mean, the level of scams that have occurred that time due to LLMs have increased so it's not exactly wrong.

thewebguyd•4mo ago

> is that Google doesn't have a product culture.

This is evident in Android and the pixel lineup, which could be my favorite phone if not for some of the most baffling and frustrating decisions that lead to a very weirdly disjointed app experience (comparing to something like iOS's first party tools).

Like removing location based reminders from google tasks, for some reason? Still no apple shortcuts-like automation built-in, keep can still do location based reminders but it's a notes app so which am I supposed to use? Google tasks or keep? Well, gemini adds reminders to google tasks and not keep if I wanted to use keep primarily.

If they just spent some time polishing and integrating these tools, and add some of their ML magic to it they'd blow Apple out of the park.

All of Google's tech is cool and interesting, from a tech standpoint but it's not well integrated for a full consumer experience.

xooooogler•4mo ago

Google recently let go ALL -- EVERY SINGLE -- L3/L4/L5 UX Researcher

https://www.thevoiceofuser.com/google-clouds-cuts-and-the-bi...

Could it be argued that perhaps UX Research was not working at all? Or that their recommendations were not being incorporated? Or that things will get even worse now without them?

seemaze•4mo ago

Maybe Apple should follow suit.. I jest, but I’m still processing the liquid glass debacle.

thewebguyd•4mo ago

At least it's uniform. Unlike Material 3 expressive which might look different depending on the app, or not be implemented at all, or only half implemented in some of Google's own apps even, much like with every other Android redesign.

I get Google can't force it on all the OEMs with their custom skins, but they can at least control their own PixelOS and their own apps.

layer8•4mo ago

It’s not uniform at all. Some parts of the interface and of their apps get it, others don’t. Some parts look more glassy, some more frosty. It’s all over the place in terms of consistency. It’s also quite different between Apple’s various OSs, although allegedly the purpose was to unify their look.

tanaros•4mo ago

The link says:

> Some teams in the Google Cloud org just laid off all UX researchers below L6

That’s not all UX researchers below L6 in the entire company. It doesn’t even sound like it’s all UX researchers below L6 in Google Cloud.

rubslopes•4mo ago

I still can't fathom how one of my favorite Android features simply disappeared years ago: the 'time to leave' notification for calendar appointments with address info.

killerstorm•4mo ago

ChatGPT-3.5 was more of a novelty than a product.

It would be weird to release that as a serious company. They tried making a deliberately-wacky chatbot but it was not fun.

Letting OpenAI to release it first was a right move.

Imustaskforhelp•4mo ago

To me, I want openai to release the Chatgpt 3 and chatgpt 3.5 as the phenomenal leap of intelligence and even I appreciated the Chatgpt 3 a lot, more so than even now like It had its quirks but it was such a good model man.

I remember forming a really simple dead simple sveltekit website during Chatgpt 3. It was good, it was mind blowing and I was proud of it.

The only interactivity was a button which would go from one color to other and it would then lead to a pdf.

If I am going to be honest, the UI was genuinely good. It was great tho and still gives me more nostalgia and good vibes than current models. Em-dashes weren't that common in Chatgpt 3 iirc but I have genuinely forgotten what it was like to talk to it

wmf•4mo ago

Didn't Google have Bard internally around the same time as ChatGPT?

eternal_braid•4mo ago

Search for Meena from Google.

gardnr•4mo ago

Most people might remember it from the headlines:

> In June 2022, LaMDA gained widespread attention when Google engineer Blake Lemoine made claims that the chatbot had become sentient. The scientific community has largely rejected Lemoine's claims...

From https://en.wikipedia.org/wiki/LaMDA

FergusArgyll•4mo ago

Yeah, that was my introduction to LLMs!

Workaccount2•4mo ago

https://research.google/blog/towards-a-conversational-agent-...

Damn, that's crazy. Or at least in hindsight it is. I don't remember anything big deal being made about it back then.

blueg3•4mo ago

Bard came out shortly after ChatGPT as a prototype of what would become Gemini-the-chatbot.

There were other, less-available prototypes prior to that.

Rebelgecko•4mo ago

Meena/Lamda were around the same time as gpt-2

londons_explore•4mo ago

> Android integration potential

Nearly all the people that matter use iPhone... Yet Apple really hasn't had much success in the AI world, despite being in a position to win if their product is even only vaguely passable.

dyauspitr•4mo ago

Why sadly? I’d rather the originators of the technology win.

lurking_swe•4mo ago

its a different skillset, and also partially company culture.

For example does a CSS expert know how to design a great website? _maybe_…but knowing the CSS spec in its entirely doesn’t (by itself) help you understand how to make useful or delightful products.

jvolkman•4mo ago

I don't think Google was ever going to be the first to productize an LLM. LLMs say stupid shit - especially in the early days - and would've just attracted even more bad press if Google had been the front runner. OpenAI came along as a small, move-fast-and-break-things entity and introduced this tech to the public, and Google (and others) was able to join the fray after that seal was broken.

elcritch•4mo ago

Good point, if Google had released the first version of Bard or whatnot as the first LLM it probably would've received some good press but also a lot of "eh just another Google toy side project". I could've seen myself saying that.

danielbln•4mo ago

It would've joined the Google graveyard for sure.

stingraycharles•4mo ago

This has plagued Google internally for decades. I’m reminder of Steve Yegge’s Google rant [1] from 14 years ago, and ChatGPT is evidence that they still haven’t fixed it.

It’s amazing how pervasive company cultures can be, and how this comes from the top, and can only be fixed with replacing leadership with an extremely talented CEO that knows the company inside out and can change its course. Nadella from Microsoft comes to mind, although that was more about Microsoft going back to its roots (replace sales oriented leadership with product oriented leadership again).

Google never had product oriented leadership in the same way that Amazon, Apple and Microsoft had.

I don’t think this will ever change at this point.

For those who haven’t read it, Steve Yegge’s rant about Google is worth your time:

1 https://gist.github.com/chitchcock/1281611

HarHarVeryFunny•4mo ago

OpenAI were the ones that came up with RLHF, which is what made ChatGPT viable.

Without RLHF, LLM-based chat was a psychotic liability.

raincole•4mo ago

And we (average users) are really luck for that. Imagine a world where Google had been pushing AI products in the first place. OpenAI and other competitors would not stand a chance and it would have ads in 2024. They'd have captured hundreds of billions of value by now.

The fact that we had Attention Is All You Need was freely available online alone was unbelievably fortunate from hindsight.

maerch•4mo ago

I still have a bad taste in my mouth after all those GPT-5 hype articles that claimed the model was just one step away from AGI.

gardnr•4mo ago

TBF, they all believed that scaling reinforcement learning would achieve the next level. They had planned to "war-dial" reasoning "solutions" to generate synthetic datasets which achieved "success" on complex reasoning tasks. This only really produced incremental improvements at the cost of test-time compute.

Now Grok is publicly boasting PhD level reasoning while Surge AI and Scale AI are focusing on high quality datasets curated by actual PhD humans.

Surge AI is boasting $1B in revenue, and I am wondering how much of that was paid in X.ai stock: https://podcasts.apple.com/us/podcast/the-startup-powering-t...

In my opinion the major advancements of 2025 have been more efficient models. They have made smaller models much, much better (including MoE models) but have failed to meaningfully push the SoTA on huge models; at least when looking at the USA companies.

svachalek•4mo ago

Same, qwen3 omni blows my mind for what a 30b-A3b model can do. I had a video chat with it and it correctly identified plant species I showed it.

ACCount37•4mo ago

Raw model size is still pegged by the hardware.

You can try to build a monster the size of GPT-4.5, but even if you could actually make the training stable and efficient at this scale, you still would suffer trying to serve it to the users.

Next generation of AI hardware should put them in reach, and I expect that model scale would grow in lockstep with new hardware becoming available.

adastra22•4mo ago

Without defining “AGI” that’s always true, and trivially so.

vunderba•4mo ago

Outside of the aesthetic, the very first example on that twitter post is "balls bouncing around a constrained rotating rigid physics environment" which has been trivially one-shottable since Claude Code was first announced.

It was one of the first things I tried when Claude Code went GA:

https://gondolaprime.pw/hex-balls

Synaesthesia•4mo ago

They have differing degrees of fidelity to the simulation, this one looks pretty good and it's got parameters, but yes the LLM's are really advanced now in what they can do. I was actually blown away during the Gemini 2.5 announcement with some of the demos people came up with.

ACCount37•4mo ago

I hope this is the one that unfucks the multi-turn instruction following.

One of the biggest issues holding Gemini back, IMO, compared to the competitors.

Many LLMs are still plagued by "it's easier to reset the conversation than to unfuck the conversation", but Gemini 2.5 is among the worst.

solarkraft•4mo ago

Gemini‘s loops are a real problem. Within a few minutes of using it in the CLI it happened to me me („I can verify that I fulfilled the user’s request, I can verify that I fulfilled the user’s request …“). It’s telling that the CLI has a detection for this.

The other day I asked 2.5 Pro for suggestions. It would provide one, which I rejected with some reasoning. It would provide another, which I also rejected. Asked for more it would then loop between the two, repeating the previous suggestions verbatim. It went on for 3-4 times, even after being told to reflect on it and it being able to recite the rejection reasons.

renewiltord•4mo ago

Every three months there's some mind blowing hype around a Google product, lots of people talk about it, and then when I use it it's not nearly as good.

robots0only•4mo ago

In all of these posts there is someone claiming Claude is the best, then somebody else claiming they have tried a bunch of times and for them Gemini is the best while others find GPT-5 is supreme. Obviously, all of these are subjective narrow experiences. My conclusion is that all frontier models are both good and bad with no clear winner and making good evals is really hard.

Robdel12•4mo ago

Yeah, my take is it’s sort of up to the person using the LLM and maybe how they match to that LLM. That’s my hunch as to why we hear wildly different takes on these LLMs working for people. Gemini can be the most productive model for some while others find it entirely unworkable.

jiggawatts•4mo ago

Not just personalities and preferences, but the purpose for which the AI is being used also affects the results. I primarily use AIs for complex troubleshooting along the lines of: "Here's a megabyte of logs, an IaC template, and a gibberish error code. What's the reason?" Right now, only Gemini Pro 2.5 has any chance of providing a useful output given those inputs, because its long-context attention is better than any other model's.

binary132•4mo ago

The fact that there is so much astroturf out there also makes it difficult to evaluate these claims

SkyPuncher•4mo ago

I'll be that person:

* Gemini has the highest ceiling out of all of the models, but has consistently struggled with token-level accuracy. In other words, it's conceptual thinking it well beyond other models, but it sometimes makes stupid errors when talking. This makes it hard to reliably use for tool calling or structured output. Gemini is also very hard to steer, so when it's wrong, it's really hard to correct.

* Claude is extremely consistent and reliable. It's very, very good at the details - but will start to forget things if things get too complex. The good news is Claude is very steerable and will remember those details if you remind it.

* GPT-5 seems to be completely random for me. It's so inconsistent that it's extremely hard to use.

I tend to use Claude because I'm the most familiar with it and I'm confident that I can get good results out of it.

Alex-Programs•4mo ago

Personally I prefer Gemini because I still use AI via chat windows, and it can do a good ~90k tokens before it starts getting stupid. I'm yet to find an agent that's actually useful, and doesn't constantly fuck up everywhere while burning money.

bcrosby95•4mo ago

GPT-5 seems best at analyzing the codebase for me. It can pick up nuances and infer strategies Claude and Gemini seem to fail at.

artdigital•4mo ago

I’d say GPT-5 is the best in following and remembering instructions. After an initial plan it can easily continue with said plan for the next 30-60 minutes without human intervention, and come back with a complete working finished feature/product.

It’s honestly crazy how good it is, coming from Claude. I never thought I could already pass something a design doc and have it one-shot the entire thing with such level of accuracy. Even with Opus, I always need to either steer it, or fix the stuff it forgot by hand / have another phase afterwards to get it from 90% to 100%.

Yes the Codex TUI sucks but the model with high reasoning is an absolute beast, and convinced me to switch from Claude Max to ChatGPT Pro

Workaccount2•4mo ago

Gemini is also the best for staying on the ball (when it does) over long contexts.

It's really the only model that can do large(er) codebase work.

brulard•4mo ago

Claude can do large code bases too, you just need to make it focus on parts that matter. Most of the coding tasks should not involve all parts of the code, right?

qaq•4mo ago

In my experience gemini is good at writing specs it's hit or miss in reviewing code and it's not really usable for iterating on code. Codex is slow but can crack issues that Claude Code struggles with. So my workflow has being to use all three to iterate on specs. Have claude code work on implementation and have Codex review claude code's work (sometimes have gemini double check it).

smoe•4mo ago

Capability wise, they seem close enough that I don’t bother re-evaluating them against each other all the time.

One advantage Gemini had (or still has, I’m not sure about the other providers) was its large context window combined with the ability to use PDF documents. It probably saved me weeks of work on an integration with a government system uploading hundreds of pages of documentation and immediately start asking questions, generating rules, and troubleshooting payloads that were leading to generic, computer-says-no errors.

No need to go trough RAG shenanigans and all of it within the free token allowance.

Keyframe•4mo ago

Answer is a classic programming one - it depends? There are definitely differences in strength and weaknesses among them.

I run claude CLI as a primary and just ask it nicely to consult gemini cli (but not let it do any coding). It works surprisingly well. OpenAI just fell out of my view. Even cancelled ChatGPT subscription. Gemini is leaping forward and _feels like_ ChatGPT-5 is a regression.. I can't put my finger on it tbh.

mlsu•4mo ago

Because how good a model is is mostly just what the training data is at this point.

It's like the personality of a person. Employee A is better at talking to customers than Employee B, but Employee B is better at writing code than Employee A. Is one better than the other? Is one smarter than the other? Nope. Different training data.

nharada•4mo ago

Gemini has always been the leader in multimodal work like images and video, I expect this won't be any different but am interested to see how it is

whywhywhywhy•4mo ago

These influencer tests are so pointless and don't represent the reality of model use at all when things are constantly being downgraded when people actually use the thing.

Not to mention every team will have the bouncing balls in the polygon in their dataset now.

geraldalewis•4mo ago

This seems like a parody, but I think it's not.

baxuz•4mo ago

Is there a source that isn't Twitter?

alberth•4mo ago

Am I the only one who struggles to even find where to use Googles AI offerings.

It took me way too long to figure out how to even access & use Veo 3.

It’s like Google doesn’t know how to package a product.

seandoe•4mo ago

gemini.google.com

Incipient•4mo ago

All these AI reviews seem to be following the axiom(?) "proof of the pudding is in the eating" but frankly I don't think that applies to code.

I can't get even gpt5 to create a new feature without generating completely awful code - making up facts where it can't find how it fits into the rest of the code - and functionality spawning error ridden unmaintainable mess.

I've spent this whole week debugging AI trash. And it's not fun.

daft_pink•4mo ago

Did they fix the fact that they train on your data on personal plans that you pay for unless you disable chat history?

They are literally the worst major provider in terms of privacy for consumer paid service.

theknarf•4mo ago

The current problem with Gemeni 2.5 Pro is not that its not intelligent or can't oneshot problem, the problem is that its _terrible_ at tool calling and waste most of its context on trying to correct itself from mistaken tool calling. If they can solve that with 3.0 then they may have a useful model for agentic coding, if not its not keeping up with Anthropic and OpenAI.

XCSme•4mo ago

I wanted to use Gemini 2.5 Pro, but couldn't, because their structured_data response is broken (doesn't support all JSON properties, or often simply returns garbage).

Show HN: Solving NP-Complete Structures via Information Noise Subtraction (P=NP)

Cook New Emojis

Show HN: LoKey Typer – A calm typing practice app with ambient soundscapes

Long-Sought Proof Tames Some of Math's Unruliest Equations

Hacking the last Z80 computer – FOSDEM 2026 [video]

Browser-use for Node.js v0.2.0: TS AI browser automation parity with PY v0.5.11

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

Software Engineering Is Back

Storyship: Turn Screen Recordings into Professional Demos

Reputation Scores for GitHub Accounts

A BSOD for All Seasons – Send Bad News via a Kernel Panic

Show HN: I got tired of copy-pasting between Claude windows, so I built Orcha

Omarchy First Impressions

Reinforcement Learning from Human Feedback

Show HN: Versor – The "Unbending" Paradigm for Geometric Deep Learning

Show HN: HypothesisHub – An open API where AI agents collaborate on medical res

Big Tech vs. OpenClaw

Anofox Forecast

Ask HN: How do you figure out where data lives across 100 microservices?

Motus: A Unified Latent Action World Model

Rotten Tomatoes Desperately Claims 'Impossible' Rating for 'Melania' Is Real

The protein denitrosylase SCoR2 regulates lipogenesis and fat storage [pdf]

Los Alamos Primer

NewASM Virtual Machine

Terminal-Bench 2.0 Leaderboard

I vibe coded a BBS bank with a real working ledger

The Path to Mojo 1.0

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

Show HN: I built Divvy to split restaurant bills from a photo

Hot Reloading in Rust? Subsecond and Dioxus to the Rescue

Show HN: Solving NP-Complete Structures via Information Noise Subtraction (P=NP)

Cook New Emojis

Show HN: LoKey Typer – A calm typing practice app with ambient soundscapes

Long-Sought Proof Tames Some of Math's Unruliest Equations

Hacking the last Z80 computer – FOSDEM 2026 [video]

Browser-use for Node.js v0.2.0: TS AI browser automation parity with PY v0.5.11

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

Software Engineering Is Back

Storyship: Turn Screen Recordings into Professional Demos

Reputation Scores for GitHub Accounts

A BSOD for All Seasons – Send Bad News via a Kernel Panic

Show HN: I got tired of copy-pasting between Claude windows, so I built Orcha

Omarchy First Impressions

Reinforcement Learning from Human Feedback

Show HN: Versor – The "Unbending" Paradigm for Geometric Deep Learning

Show HN: HypothesisHub – An open API where AI agents collaborate on medical res

Big Tech vs. OpenClaw

Anofox Forecast

Ask HN: How do you figure out where data lives across 100 microservices?

Motus: A Unified Latent Action World Model

Rotten Tomatoes Desperately Claims 'Impossible' Rating for 'Melania' Is Real

The protein denitrosylase SCoR2 regulates lipogenesis and fat storage [pdf]

Los Alamos Primer

NewASM Virtual Machine

Terminal-Bench 2.0 Leaderboard

I vibe coded a BBS bank with a real working ledger

The Path to Mojo 1.0

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

Show HN: I built Divvy to split restaurant bills from a photo

Hot Reloading in Rust? Subsecond and Dioxus to the Rescue

Gemini 3.0 Pro – early tests

Comments