Gemini 3.0 spotted in the wild through A/B testing

https://ricklamers.io/posts/gemini-3-spotted-in-the-wild/

120•ricklamers•2h ago

Comments

Topfi•1h ago

Has been ongoing for roughly a month now, with a variety of checkpoints along the usual speculation. As it stands, I'd just wait for the official announcement, prior to making any judgement. What their release plans are, whether a checkpoint is a possible replacement for Pro, Flash, Flash Lite, a new category of model, won't be released at all, etc. we cannot know.

More importantly, because of the way AIStudio does A/B testing, the only output we can get is for a single prompt and I personally maintain that outside of getting some basic understanding on speed, latency and prompt adherence, output from one single prompt is not a good measure for performance in the day-to-day. It also, naturally, cannot tell us a thing about handling multi file ingest and tool calls, but hype will be hype.

That there are people who are ranking alleged performance solely by one-prompt A/B testing output says a lot about how unprofessionally some evaluate model performance.

Not saying the Gemini 3.0 models couldn't be competitive, I just want to caution against getting caught up in over-excitement and possible disappointment. Same reason I dislike speculative content in general, it rarely is put into the proper context cause that isn't as eyecatching.

jmkni•1h ago

I might be in the minority here but I've consistently found Gemini to be better than ChatGPT, Claude and Deepseek (I get access to all of the pro models through work)

Maybe it's just the kind of work I'm doing, a lot of web development with html/scss, and Google has crawled the internet so they have more data to work with.

I reckon different models are better at different kinds of work, but Gemini is pretty excellent at UI/UX web development, in my experience

Very excited to see what 3.0 is like

OsrsNeedsf2P•1h ago

What's your use case? We've found Gemini to work well with large context windows, but it sucks at calling MCPs and is worse at writing code

jmkni•1h ago

Building out user interfaces in html and scss (mainly in Angular)

You need to give it detailed instructions and be willing to do the plumbing yourself, but we've found it to be very good at it

cj•1h ago

I use LLMs a lot for health related things (e.g. “Here are 6 bloodwork panels over the past 12 months, here’s a list of medical information, please identify trends/insights/correlations [etc]”)

I default to using ChatGPT since I like the Projects feature (missing from Gemini I think?).

I occasionally run the same prompts in Gemini to compare. A couple notes:

1) Gemini is faster to respond in 100% of cases (most of my prompts kick ChatGPT into thinking mode). ChatGPT is slow.

2) The longer thinking time doesn’t seem to correlate with better quality responses. If anything, Gemini provides better quality analyses despite shorter response time.

3) Gemini (and Claude) are more censored than ChatGPT. Gemini/Claude often refuse medical related prompts, while ChatGPT will answer.

a_t48•1h ago

The last time I tried with ChatGPT (just to look at some MRIs to get an idea of what might be up before the turnaround from doc) it refused.

cj•1h ago

Hm, I've also uploaded MRI images to ChatGPT and it worked as expected.

I went back to the censored chat I mentioned earlier, and got it to give me an answer when adding "You are a lifestyle health coach".

gnulinux•1h ago

I agree with you, I consistently find Gemini 2.5 Pro better than Claude and GPT-5 for the following cases:

* Creative writing: Gemini is the unmatched winner here by a huge margin. I would personally go so far as to say Gemini 2.5 Pro is the only borderline kinda-sorta usable model for creative writing if you squint your eyes. I use it to criticize my creative writing (poetry, short stories) and no other model understands nuances as much as Gemini. Of course, all models are still pretty much terrible at this, especially in writing poetry.

* Complex reasoning (e.g. undergrad/grad level math): Gemini is the best here imho by a tiny margin. Claude Opus 4.1 and Sonnet 4.5 are pretty close but imho Gemini 2.5 writes more predictably correct answers. My bias is algebra stuff, I usually ask things about commutative algebra, linear algebra, category theory, group theory, algebraic geometry, algebraic topology etc.

On the other hand Gemini is significantly worse than Claude and GPT-5 when it comes to agentic behavior, such as searching a huge codebase to answer an open ended question and write a refactor. It seems like its tool calling behavior is buggy and doesn't work consistently in Copilot/Cursor.

Overall, I still think Gemini 2.5 Pro is the smartest overall model, but of course you need to use different models for different tasks.

dktp•1h ago

My pet theory is that Gemini's training is, more than others, focused on rewriting and pulling out facts from data. (As well as being cheap to run). Since the biggest use is the Google AI generated search results

It doesn't perform nearly as well as Claude or even Codex for my programming tasks though

bogtog•24m ago

I agree with the bit about creative writing, and I would add writing more generally. Gemini also allows dumping in >500k tokens of your own writing to give it a sense of your style.

The other big use-case I like Gemini for is summarizing papers or teaching me scholarly subjects. Gemini's more verbose than GPT-5, which feels nice for these cases. GPT-5 strikes me as terrible at this, and I'd also put Claude ahead of GPT-5 in terms of explaining things in a clear way (maybe GPT-5 could meet what I expect better though with some good prompting)

sega_sai•1h ago

I like Gemini 2.5 as a chatbot, but it has been mostly useless as an agent comparing to Claude Code (at least for my complex tasks)

mips_avatar•1h ago

Yeah for my agent gemini 2.5 flash performs similar in quality to gpt4.1 and it's way faster and cheaper.

mvdtnz•1h ago

I gave up on Gemini because I couldn't stop the glazing. I don't need to be told what can incredible insight I have made and why my question gets to the heart of the matter every time I ask something.

jmkni•1h ago

"Of course! That's an excellent reply to my comment!"

Joking obviously but I've noticed this too, I put up with it because the output is worth it.

froobius•1h ago

With AI studio there's a system prompt where you can tell it to stop the sycophancy.

But yeah it does do that otherwise. At one point it told me I'm a genius.

kenjackson•1h ago

I tend to find it competitive, but slightly worse on average. But they each have their strengths and weaknesses. I tend to flip between them more than I do search engines.

chazeon•1h ago

Gemini is the only model that can provide consistent solution to theoretical physics problems and output it into LaTeX document.

dmd•1h ago

I find Claude and Gemini to be wildly inferior to ChatGPT when it comes to doing searches to establish grounding. Gemini seems to do a handful of searches and then make shit up, where ChatGPT will do dozens or even hundreds of searches - and do searches based on what it finds in earlier ones.

kridsdale3•21m ago

Try "AI Mode" on Google.com (Disclaimer, I recently joined the team that makes this product).

It isn't Gemini (the product, those are different orgs) though there may (deliberately left ambiguous) be overlap in LLM level bytes.

My recommendation for you in this use-case comes from the fact that AI Mode is a product that is built to be a good search engine first, presented to you in the interface of an AI Chatbot. Rather than Gemini (the app/site) which is an AI Chatbot that had search tooling added to it later (like its competitors).

AI Mode does many more searches (in my experience) for grounding and synthesis than Gemini or ChatGPT.

dmd•16m ago

I have been playing with it recently and, yeah, it's much better than Gemini. It's still seems to be single-shot though - as in, it reads your text, thinks about it for a bit, kicks off searches, reads those searches, thinks, and answers. It never, as far as I can tell, kicks off new searches based on the thinking it did after the initial searches - whereas chatgpt will often do half a dozen or more iterations of that.

gs17•13m ago

That's my experience as well. Gemini doesn't seem interested in doing searches outside of Deep Research mode, which is kind of funny given it should have the easiest access to a top search engine.

solarkraft•1h ago

What application are you using it with? I find this to be very important, for instance it has always SUCKED for me in Copilot (copilot has always kind of sucked for me, but Gemini has managed to regularly completely destroy entire files).

How often do you encounter loops?

schainks•1h ago

Yes. Jules even writes more testable code, but people I know regularly use codex because it will bang its head against the wall and eventually give you a working implementation even though it took longer.

behnamoh•1h ago

Maybe because Jules is made by Google and 95% of Google products end up dead as soon as the product manager gets a promotion?

schainks•51m ago

Watch them retire Jules as part of Gemini 3.0 release.

behnamoh•1h ago

Gemini was good when the thinking tokens were shown to the user. As soon as Google replaced those with some thought summary, I stopped finding it as useful. Previously, the thoughts were so organized that I would often read those instead of the final answer.

dwringer•52m ago

These were extremely helpful to read for insights on how to go back and retry different prompts instead, IMHO. I find it to be a significant step back in usability to lose those although I can understand the argument that they weren't directly useful on their own outside of that use case.

kridsdale3•17m ago

In the API, the thinking tokens are just a different stream. You can still read them.

CaptainOfCoit•1h ago

> consistently found Gemini to be better than ChatGPT, Claude and Deepseek

I used Pro Mode in ChatGPT since it was available, and tried Claude, Gemini, Deepseek and more from time to time, but none of them ever get close to Pro Mode, it's just insanely better than everything.

So when I hear people comparing "X to ChatGPT", are you testing against the best ChatGPT has to offer, or are you comparing it to "Auto" and calling it a day? I understand people not testing their favorite models against Pro Mode as it's kind of expensive, but it would really help if people actually gave some more concrete information when they say "I've tried all the models, and X is best!".

(I mainly do web dev, UI and UX myself too)

jmkni•1h ago

well I'm giving them the exact same prompts and comparing the output

SweetSoftPillow•1h ago

It seems you also did not compare ChatGPT to the best offers of the competitors, as you did not mention Gemini Deepthink mode which is Google's alternative to GPT's Pro mode.

CaptainOfCoit•12m ago

> It seems you also did not compare ChatGPT to the best offers of the competitors

I am, continuously, and have been since ChatGPT Pro appeared.

sosodev•1h ago

I swear HN commenters say this about every frontier model.

montebicyclelo•1h ago

Agreed, and its larger context window is fantastic. My workflow:

- Convert the whole codebase into a string

- Paste it into Gemini

- Ask a question

People seem to be very taken with "agentic" approaches were the model selects a few files to look at, but I've found it very effective and convenient just to give the model the whole codebase, and then have a conversation with it, get it to output code, modify a file, etc.

Galanwe•34m ago

I usually do that in a 2 step process. Instead of giving the full source code to the model, I will ask it to write a comprehensive, detailed, description of the architecture, intent, and details (including filenames) of the codebase to a Markdown file.

Then for each subsequent conversation I would ask the model to use this file as reference.

The overall idea is the same, but going through an intermediate file allows for manual amendments to the file in case the model consistently forgets some things, it also gives it a bit of an easier time to find information and reason about the codebase in a pre-summarized format.

It's sort of like giving a very rich metadata and index of the codebase to the model instead of dumping the raw data to it.

kridsdale3•18m ago

My special hack on top of what you suggested: Ask it to draw the whole codebase in graphviz compatible graphing markup language. There are various tools out there to render this as an SVG or whatever, to get an actual map of the system. Very helpful when diving in to a big new area.

leetharris•34m ago

For anyone wondering how to quickly get your codebase into a good "Gemini" format, check out repomix. Very cool tool and unbelievably easy to get started with. Just type `npx repomix` and it'll go.

Also, use Google AI Studio, not the regular Gemini plan for the best results. You'll have more control over results.

asah•29m ago

try codex and claude code - game changing ability to use CLI tools, edit/reorg multiple files, even interact with git.

whatever1•1h ago

Looking at the responses. How the F have people so wildly different opinions on the relative performance of the same systems?

jmkni•14m ago

Different prompts/approaches?

I "grew up", as it were, on StackOverflow, when I was in my early dev days and didn't have a clue what I was doing I asked question after question on SO and learned very quickly the difference between asking a good question vs asking a bad one

There is a great Jon Skeet blog post from back in the day called "Writing the perfect question" - https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-...

I think this is as valid as ever in the age of AI, you will get much better output from any of these chatbots if you learn and understand how to ask a good question.

esafak•1h ago

I find Gemini excels at greenfield, big picture tasks. I use Sonnet and Codex for implementation.

lysace•1h ago

Agreed. There seems to be some very strong anti-Google force on HN. I guess there's just a lot of astroturfing in this area.

swalsh•1h ago

We've moved to it for our clinical workflow agents. Great quality, better pricing and performance compared to Anthropic.

stared•1h ago

Depends on the task, our tastes, and our workflow. In my case:

For writing and editorial work, I use Gemini 2.5 Pro (Sonnet seems simply worse, while GPT5 too opinionated).

For coding, Sonnet 4.5 (usually).

For brainstorming and background checks, GPT5 via ChatGPT.

For data extraction, GPT5. (Seems to be the best at this "needle in a haystack".)

SkyPuncher•1h ago

Gemini is theoretically better, but I find it's very unsteerable. Combine that with the fact it struggles with tool use and character-level issues - and it can be challenging to use despite being "smarter".

jmkni•50m ago

I agree with the steerable angle, it's like driving a fast car with no traction control

However if you get the hang of it, it can be very powerful

bee_rider•50m ago

What does it mean for one model to be theoretically better than another?

erichocean•1h ago

I use GPro 2.5 exclusively for coding anything difficult, and Claude Opus otherwise.

Between the two, 100% of my code is written by AI now, and has been since early July. Total gamechanger vs. earlier models, which weren't usable for the kind of code I write at all.

I do NOT use either as an "agent." I don't vibe code. (I've tried Claude Code, but it was terrible compared to what I get out of GPro 2.5.)

Insanity•52m ago

I used Gemini at work, and would probably agree with your sentiment. For personal usage though, I've stuck with ChatGPT (pro subscriber).. the ChatGPT app has become my default 'ask a question' versus google, and I never reach for Gemini in personal time.

vb-8448•48m ago

gemini used to be the top for me until gpt-5 (web dev with html/js/css + python) ... and also with gpt-5 around it's doing its job, but it's really slow.

Jweb_Guru•46m ago

It's definitely not just you. Gemini is the only one that's consistently done anything actually useful for me on the kinds of problems I work on (which don't have a whole lot of boilerplate code). Unlike the other models it occasionally catches real errors in complex reasoning chains.

sreekanth850•32m ago

You are not alone, I got betetr result with Gemini free tier. Use their Code assist in VS code.

bushbaba•24m ago

I find Gemini to be too verbose in its responses.

AaronAPU•20m ago

It has been consistently better at least with C++ ever since like o3, in my experience. The last ChatGPT model I loved was o1-pro.

rafark•20m ago

Yeah it’s really good. A few weeks ago, some third party script was messing with click events of my react buttons so I figured I should just add a mousedown even to capture the click before the other script. It was late at night and I was exhausted so I wanted to do a quick and dirty approach of simulating a click after a few ms after the mousedown even. So I told Gemini my plan and asked it to tell me the average time in ms for a click event in order to simulate it… and I was shocked when it straight up refused and told me instead to trigger the event on mouseup in combination with mousedown (on mouse down set state and on mouse up check the state and trigger the event). This was of course a much better solution. I was shocked at how it understood the problem perfectly and instead of giving me exactly what I asked for it gave me the right way to go about it.

sauwan•18m ago

For pure text responses, agree 100%. Gemini falls way short on tool/function calling, and it's not very token-efficient for those of us using the API. But if they can fix those two things or even just get them in the same ballpark like they did with flash and flash-lite, it would easily become my primary model.

alecco•10m ago

I completely disagree. For me the best for bulk coding (with very good instructions) is Sonnet 4.5. Then GPT-5 codex is slower but better guessing what I want with tiny prompts. Gemini 2.5 Pro is good to review large codebases but for real work usually gets confused a lot, not worth it. (even though I was forced to pay for it by Google, I rarely use it).

But the past few days I started getting an "AI Mode" in Google Search that rocks. Way better than GPT-5 or Sonnet 4.5 for figuring out things and planning. And I've been using without my account (weird, but I'm not complaining). Maybe this is Gemini 3.0. I would love for it to be good at coding. I'm near limits on my Anthropic and OpenAI accounts.

msp26•1h ago

Rumour is a release on the 22nd I believe

FergusArgyll•1h ago

Bet on it!

https://manifold.markets/ItsMe/gemini-3-releases-october-22

smusamashah•1h ago

https://x.com/chetaslua is experimenting a lot with Gemini 3 and posting its results (various web desktops, a vampire survivor clone which is actually very playable, voxel 3d models, other game clones, SVG etc). They look really good, specially when they are one-shot.

joshhug•19m ago

This was cool: https://codepen.io/ChetasLua/pen/yyezLjN

Somewhat amusing 4th wall breaking if you open Python from the terminal in the fake Windows. Examples: 1. If you try to print something using the "Python" print keyword, it opens a print dialog in your browser. 2. If you try to open a file using the "Python" open keyword, it opens a new browser tab trying to access that file.

That is, it's forwarding the print and open calls to your browser.

joshhug•17m ago

Ah, that's because the "python" is actually just using javascript evals.

} else if (mode === 'python') { if (cmd === 'exit()') { mode = 'sh'; } else { try { // Safe(ish) eval for demo purposes. // In production, never use eval. Use a JS parser library. // Mapping JS math to appear somewhat pythonesque let result = eval(cmd); if (result !== undefined) output(String(result)); } catch (e) { output(`Traceback (most recent call last):\n File "<stdin>", line 1, in <module>\n${e.name}: ${e.message}`, true); } }

solarkraft•1h ago

I hope they are going to solve the looping problem. It’s real and it’s awful. It’s so bad that the CLI has a loop detection which I promptly ran into after a minute of use.

In the Gemini app 2.5 Pro also regularly repeats itself VERBATIM after explicitly being told not to multiple times to the point of uselessness.

kristofferR•1h ago

I hope Gemini 3.0 will also be free, like Gemini 2.5 Pro is if you use the CLI or the right subdomain.

floppyd•13m ago

2.5 Pro is limited to 100 request per day every where I think. My Gemini CLI is authed through the Google Account (not API key) and after 100 requests it switches to Flash, API keys are also limited to 100 requests each (and I think there's a limit on free keys now as well)

SweetSoftPillow•1h ago

And there are some wild examples: https://news.ycombinator.com/item?id=45578346

incomingpain•50m ago

This is super exciting. Gemini 2.5 pro was starting to feel like it's lagging behind a little bit; or at least it's still near the best but 3.0 had to be coming along.

It's my goto coder; it just jives better with me than claude or gpt. Better than my home hardware can handle.

What I really hope for 3.0. Their context length is real 1 million. In my experience 256k is the real limit.

jedberg•46m ago

> Gemini 3.0 is one of the most anticipated releases in AI at the moment because of the expected advances in coding performance.

Based on what I'm hearing from friends who work at Google and are using it for coding, we're all going to be very disappointed.

Edit: It sound like they don't actually have Gemini 3 access, which would explain why they aren't happy with it.

phendrenad2•21m ago

Which should surprise no one. LLMs are reaching diminishing returns, unless we find a way to build GPUs more cheaply.

mwest217•16m ago

Gemini 3.0 isn't broadly available inside Google. There's are "Gemini for Google" fine-tuned versions of 2.5 Pro and 2.5 Flash, but there's been no broad availability of any 3.0 models yet.

Source: I work at Google (on payments, not any AI teams). Opinions mine not Google's.

kridsdale3•16m ago

Hate to spoil this excitement, but we at Google do not have Gemini 3 available to us for use in Vibecoding.

andrewstuart•38m ago

ChatGPT is great at analysis and problem solving but often gets lost and loses code and ends up in a tangle when trying to write the code.

So I get ChatGPT to spec out the work as a developer brief including suggested code then I give it to Gemini to implement.

deepanwadhwa•28m ago

Gemini2.5 Pro has assisted me better in every aspect of AI as compared to ChatGPT5. I hope they don't screw up Gemini 3 like OpenAI screwed ChatGPT with GPT5.

adjbsibdunhe•23m ago

Adjhe

grej•16m ago

My strange observation is that Gemini 2.5 Pro is maybe the best model overall for many use cases, but starting from the first chat. In other words, if it has all the context it needs and produces one output, it's excellent. The longer a chat goes, it gets worse very quickly. Which is strange because it has a much longer context window than other models. I have found a good way to use it is to drop the entire huge context of a while project (200k-ish tokens) into the chat window and ask one well formed question, then kill the chat.

CaptainOfCoit•10m ago

> The longer a chat goes, it gets worse very quickly.

This has been the same for every single LLM I've used, ever, they're all terrible at that.

So terrible that I've stopped going beyond two messages in total. If it doesn't get it right at the first try, its more and more unlikely to get it right for every message you add.

Better to always start fresh, iterate on the initial prompt instead.

Dyerlingo

Play abstract strategy board games online with friends or against bots

PostHog just turned their Homepage UX into a Computer

Gezira: A Deep Dive

Collabora and MediaTek: Pushing boundaries on latest IoT boards and Chromebooks

The Art of Scaling Reinforcement Learning Compute for LLMs

Python as a Configuration Language Using Starlark

Counsel Health Grabs $25M for AI-Augmented Healthcare Service

Deel raises $300M Series E at $17.3B valuation

Fantastic (Small) Retrievers and How to Train Them

M5 MacBook Pro Does Not Include a Charger in the Box in Europe

Google's AI Cracks a New Cancer Code

Making roads safer with a new centre line road marking policy

Making Context-Aware Components: How CSS Inherit() Could Simplify Design Systems

Apple M4 Series Feature Support

Mecabricks – create and display 3D Lego models

Pop star laments missed SF tech investment that would've made him $5B

Apple Readies High-End MacBook Pro with Touch Hole-Punch Screen

Thoguhts on AI Compliance

Private Credit on the Defensive Again over 'Mark-to-Myth' Study

FTX Was Never Insolvent? A Prison Interview with Sam Bankman-Fried

Rebuilding Uber's Apache Pinot Query Architecture

Porting a Segmented List from C to Rust

Timely Arrival: Great British Railways Clock Launches at London Bridge

Show HN: AI Chat Terminal – Private data stays local, rest goes to cloud

Modeling Developer Burnout with GenAI Adoption

Chamber of Commerce Sues over Trump's New $100k H-1B Visa Fee

Tesla brings back 'Mad Max' 'Full Self-Driving' mode that ignores speed limits

Show HN: Gen AI for fonts, 1M free fonts organized by "vibe"

Which Collatz numbers do Busy Beavers simulate (if any)?

Gemini 3.0 spotted in the wild through A/B testing

Comments

Dyerlingo

Play abstract strategy board games online with friends or against bots

PostHog just turned their Homepage UX into a Computer

Gezira: A Deep Dive

Collabora and MediaTek: Pushing boundaries on latest IoT boards and Chromebooks

The Art of Scaling Reinforcement Learning Compute for LLMs

Python as a Configuration Language Using Starlark

Counsel Health Grabs $25M for AI-Augmented Healthcare Service

Deel raises $300M Series E at $17.3B valuation

Fantastic (Small) Retrievers and How to Train Them

M5 MacBook Pro Does Not Include a Charger in the Box in Europe

Google's AI Cracks a New Cancer Code

Making roads safer with a new centre line road marking policy

Making Context-Aware Components: How CSS Inherit() Could Simplify Design Systems

Apple M4 Series Feature Support

Mecabricks – create and display 3D Lego models

Pop star laments missed SF tech investment that would've made him $5B

Apple Readies High-End MacBook Pro with Touch Hole-Punch Screen

Thoguhts on AI Compliance

Private Credit on the Defensive Again over 'Mark-to-Myth' Study

FTX Was Never Insolvent? A Prison Interview with Sam Bankman-Fried

Rebuilding Uber's Apache Pinot Query Architecture

Porting a Segmented List from C to Rust

Timely Arrival: Great British Railways Clock Launches at London Bridge

Show HN: AI Chat Terminal – Private data stays local, rest goes to cloud

Modeling Developer Burnout with GenAI Adoption

Chamber of Commerce Sues over Trump's New $100k H-1B Visa Fee

Tesla brings back 'Mad Max' 'Full Self-Driving' mode that ignores speed limits

Show HN: Gen AI for fonts, 1M free fonts organized by "vibe"

Which Collatz numbers do Busy Beavers simulate (if any)?