It solves the “longer thinking leads to worse results” problem by approaching multiple paths of thinking in parallel, but just not think as long.
Isn’t the compute effort N times as expensive, where N is the number of agents? Unless you meant in terms of time (and even then, I guess it’d be the slowest of the N agents).
You also run into context issues and quality degradation the longer you go.
(this is assuming gemini uses a traditional arch, and not something special regarding attention)
Before this, even the best "math" models were RLd to death to only solve problems. If you wanted it to explore "method_a" of solving a problem you'd be SoL. The model would start like "ok, the user wants me to explore method_a, so here's the solution: blablabla doing whatever it wanted, unrelated to method_a.
Similar things for gathering multiple sources. Only recently can models actually pick the best thing out of many instances, and work effectively at large context lengths. The previous tries with 1M context lengths were at best gimmicks, IMO. Gemini 2.5 seems the first model that can actually do useful stuff after 100-200k tokens.
I don't think this is correct, especially given MoE. You can save some memory bandwidth by reusing model parameters, but that's about it. It's not giving you the same speed as a single query.
Just running the model multiple times on the same input and selecting the best response (according to some judgement) seems a bit of a haphazard way of getting much diversity of response, if that is really all it is doing.
There are multiple alternate approaches to sampling different responses from the model that come to mind, such as:
1) "Tree of thoughts" - generate a partial response (e.g. one token, or one reasoning step), then generate branching continuations of each of those, etc, etc. Compute would go up exponentially according to number of chained steps, unless heavy pruning is done similar to how it is done for MCTS.
2) Separate response planning/brainstorming from response generation by first using a "tree of thoughts" like process just to generate some shallow (e.g. depth < 3) alternate approaches, then use each of those approaches as additional context to generate one or more actual responses (to then evaluate and choose from). Hopefully this would result in some high level variety of response without the cost of of just generating a bunch of responses and hoping that they are usefully diverse.
> Deep Think pushes the frontier of thinking capabilities by using parallel thinking techniques. This approach lets Gemini generate many ideas at once and consider them simultaneously, even revising or combining different ideas over time, before arriving at the best answer.
This doesn't exclude the possibility of using multiple agents in parallel, but to me it doesn't necessarily mean that this is what's happening, either.
So if someone cool enough, they could actually give us a DeepThought model?
Please, let that happen.
Vendor-DeepThought-42B maybe?
Yes, but the response time is terrible. 7.5 million years
If fixed set means fixed number it would be nice to know how many.
Otherwise i would like to know what fixed set means here.
Apparently the model will think for 30+ minutes on a given prompt. So it seems it's more for research or dense multi-faceted problems than for general coding or writing fan fic.
OpenAI is the only one that says "practically unlimited" and I have never hit any limit on my ChatGPT Pro plan. I hit limits on Claude Max (both plans) several times.
Why are these companies not upfront about what the limits are?
Most likely because they reserve the right to dynamically alter the limits in response to market demands or infrastructure changes.
See, for instance, the Ghibli craze that dominated ChatGPT a few months ago. At the time OpenAI had no choice but to severely limit image generation quotas, yet today there are fewer constraints.
tldr: We can't have nice things, because we are assholes.
A fair pricing model would be token-based, so that a user can see for each query how much they cost, and only pay for what they actually used. But AI companies want a steady stream of income, and they want users to pay as much as possible, while using as little as possible. Therefore they ask for a monthly or even yearly price with an unknown number of tokens included, such that you will always pay more then with token-based payments.
In most cases, atleast claude does for sure. So yea, for now, they're losing money anyways
Claude Code gets the most out of Anthropic’s models, that’s why people love it.
Conversely, Gemini CLI makes Gemini Pro 2.5 less capable than the model itself actual is.
It’s such a stark difference I’ve given up using Gemini CLI even with it being free, but still use it for situations amenable to a prompt interface on a regular basis. It’s a very strong model.
But if I give it structure so it can write its own context, it is truly astonishing.
I'll describe my big, general task and tell it to first read the codebase and then write a detailed requirements document, and not to change any code.
Then I'll tell it to read the codebase and the detailed requirements document it just wrote, and then write a detailed technical spec with API endpoints, params, pseudocode for tricky logic, etc.
Then I'll tell it to read the codebase, and the requirements document it just wrote, and the tech spec it just wrote, and decomp the whole development effort into weekly, daily and hourly tasks to assign to developers and save that in a dev plan document.
Only then is it ready to write code.
And I tell it to read the code base, requirements, tech spec and dev plan, all of which it authored, and implement Phase 1 of the dev plan.
It's not all mechanical and deterministic, or I could just script the whole process. Just like with a team of junior devs, I still need to review each document it writes, tweak things I don't like, or give it a better prompt to reflect my priorities that I forgot to tell it the first time, and have it redo a document from scratch.
But it produces 90% or more of its own context. It ingests all that context that it mostly authored, and then just chugs along for a long time, rarely going off the rails anymore.
It's not yet available via an API.
It’s just bizarrely uncompetitive with o3-pro and Grok 4 Heavy. Anecdotally (from my experience) this was the one feature that enthusiasts in the AI community were interested in to justify the exorbitant price of Google’s Ultra subscription. I find it astonishing that the same company providing free usage of their top models to everybody via AI Studio is nickel-and-diming their actual customers like that.
Performance-wise. So far, I couldn’t even tell. I provided it with a challenging organizational problem that my business was facing, with the relevant context, and it proposed a lucid and well-thought-out solution that was consistent with our internal discussions on the matter. But o3 came to an equally effective conclusion for a fraction of the cost, even if it was less “cohesive” of a report. I guess I’ll have to wait until tomorrow to learn more.
If it's CapEx it's -- by definition -- not a cost to run. Energy costs will trend to zero.
After that it's bumped down to Flash, which is surpisingly effective in Gemini CLI.
If I need Pro, I just swap in an API from an account with billing enabled, but usually 100 requests is enough for a day of work.
Underpriced for consumers, overpriced for businesses.
I agree that’s not a good posture, but it is entirely unsurprising.
Google is probably not profiting from AI Ultra customers either, and grabbing all that sweet usage data from the free tier of AI Studio is what matters most to improve their models.
Giving free access to the best models allows Google to capture market share among the most demanding users, which are precisely the ones that will be charged more in the future. In a certain sense, it’s a great way for Google to use its huge idle server capacity nowadays.
I doubt I'm an isolated case. This Gemini gig will cost Google a lot, they pushed it on all android phones around the globe. I can't wait to see what happens when they have to admit that not many people will pay over 20 bucks for "Ai", and I would pay well over 20 bucks just to see the face of the c suite next year when one dares to explain in simple terms there is absolutely no way to recoup the DC investment and that powering the whole thing will cost the company 10 times that.
FWIW, Google seems to be having some severe issues with oddball, perhaps malfunctioning quota systems. I'm regularly finding extraordinarily little use of gemini-cli is hitting the purported 1000 request limit, when in reality I've done less than 10.
Also, I noticed Gemini (even flash) has Google search support. But only via the web UI or the native mobile app. Via the API that would requires serp via MCP of sort. Even with Gemini pro.
Oh, some models are regularly facing outages. 503s are not uncommon. No SLA page, alerts, whatsoever.
The reasoning feature is buggy, even if disabled, it sometimes triggers anyway.
It occured to me the other day that Google probably have the best engineers given how good Gemini performs and where it's coming from, and the context window that is uniquely large compared to any other model. But that it is likely operated by managers coming from AWS where shipping half baked, barely tested software, was all it took to get a bonus.
But yes, Google should have figured that out and used a less expensive mode of reasoning.
And a question to the knowledgeable: does a simple/stupid question cost more in terms of resources then a complex problem? in terms of power consumption.
"Here I am, brain the size of a planet, and they ask me to ..."
Even proof mining and the Harrop formula have to exclude disjunction and existential quantification to stay away from intuitionist math.
IID in PAC/ML implies PEM which is also intentionally existential quantification.
This is the most gentle introduction I know of, but remember LLMs are fundamentally set shattering, and produce disjoint sets also.
We are just at reactive model based systems now, much work is needed to even approach this if it ever is even possible.
[0] https://www.cmu.edu/dietrich/philosophy/docs/tech-reports/99...
I see this semi-regularly: futile attempts at handwaving away the obvious intelligence by some formal argument that is either irrelevant or inapplicable. Everything from thermodynamics — which applies to human brains too — to information theory.
Grey-bearded academics clinging to anything that might float to rescue their investment into ineffective approaches.
PS: This argument seems to be that LLMs “can’t think ahead” when all evidence is that they clearly can! I don’t know exactly what words I’ll be typing into this comment textbox seconds or minutes from now but I can — hopefully obviously — think intelligent thoughts and plan ahead.
PPS: The em-dashes were inserted automatically by my iPhone, not a chat bot. I assure you that I am a mostly human person.
Well, I don't think it's easy or even generally possible to recognize a problem complexity. Imagine you ask for a solution for a simple expressed statement like find an n > 2 where z^n = x^n + y^n. The answer you will receive will be based on a trained model with this well known problem but if it's not in the model it could be impossible to measure its complexity.
This is why model pickers persist despite no one liking them.
Once you've given the model your prompt and are reading the first output token for classification, you've already paid most of the cost of just prompting it directly.
That said, there could definitely be exceptions for short prompts where output costs dominate input costs. But these aren't usually the interesting use cases.
I don't disagree that there are hard problems which use short prompts, like math homework problems etc., but they mostly aren't what I would categorize as "real work". But of course I can only speak to my own experience /shrug.
Source? Afaik this is incorrect.
Caching does help the situation, but you always at least pay the initial cache write. And prompts need to be structured carefully to be cacheable. It’s not a free lunch.
I've been working on a challenging problem all this week and all the AI copilot models are worthless helping me. Mastery in coding is being alone when nobody else nor AI copilots can help you and you have dig deep into generalization, synthesis, and creativity.
(I thought to myself, at least it will be a little while longer before I'm replaced with AI coding agents.)
Probably the starkest example of this is build system stuff: it's really obvious which ones have seen a bunch of `nixpkgs`, and even the best ones seem to really struggle with Bazel and sometimes CMake!
The absolute prestige high-end ones running flat out burning 100+ dollars a day and it's a lift on pre-SEO Google/SO I think... but it's not like a blowout vs. a working search index. Back when all the source, all the docs, and all the troubleshooting for any topic on the whole Internet were all above the fold on Google? It was kinda like this: type a question in the magic box and working-ish code pops out. Same at a glory-days FAANG with the internal mega-grep.
I think there's a whole cohort or two who think that "type in the magic box and code comes out" is new. It's not new, we just didn't have it for 5-10 years.
It's somewhat easier to perceive the creativeless aspect with stable diffusion. I'm not talking about the missing limb or extra finger glitches. With a bit of experience looking through generated images our brain eventually perceives the absolute lack of creativity, an artist probably spot it without prior experience with generative AI pieces. With LLMs it takes a bit longer.
Anecdotal, baseless I guess. Papers were published, some researchers in the fields of science couldn't get the best LLMs to solve any unsolved problem. I recently came across a paper stating bluntly that all LLMs tested were unable to conceptualize, nor derive laws that generalize whatsoever. E.g formulas.
We are being duped, it doesn't help selling $200 monthly subscriptions - soon for even more - if marketers admitted there is absolutely zero reasoning going on with these stochastic machines on steroids.
I deeply wish the circus ends soon, so that we can start focusing on what LLMs are excellent, well fitted to do better, faster than humans.
Creative it is not.
Thus, AI is a great productivity tool if you know how to use it for the overwhelming majority of problems out there. And it's a boost even for those that are not even good at the craft as well.
This whole narrative of "okay but it can't replace me in this or that situation" is honestly between an obvious touche (why would you think AI would replace rather than empower those who know their craft) and stale luddism.
Even IF that were true (and I'd argue that it is NOT, and it's people who believe that and act that way who produce the tangled messes of spiderweb code that are utterly opaque to public searches and AI analysis -- the supposed "1%"), if even as low as 1% of the code I interacted with was the kind of code that required really deep thought and analysis, it could easily balloon to take up as much time as the other "99%".
Oh, and Ned Ludd was right, by the way. Weavers WERE replaced by the powered loom. It is in the interest of capital to replace you if they are able to, not to complement you, and furthermore, the teeth of capital have gotten sharper over time, and its appetite more voracious.
In my experience Grok 4 and 4 Heavy have been crap. Who cares how many requests you get with it when the response is terrible. Worst LLM money I’ve spent this year and I’ve spent a lot.
OpenAI reasoning models (o1-pro, o3, o3-pro) have been the strongest, in my experience, at harder problems, like finding race conditions in intricate concurrency code, yet they still lag behind even the initial sonnet 3.5 release for writing basic usable code.
The OpenAI models are kind of like CS grads who can solve complex math problems but can't write a decent React component without yadda-yadda-ing half of it, while the Anthropic models will crank out many files of decent, reasonably usable code while frequently missing subtleties and forgetting the bigger picture.
The line where chatbots stop being sensible and start outputting garbage is in movement, but slower than avg joe would guess. You only notice it when you get an intuition of the answer before you see it, which requires a lot of experience on range of complexity. Persisten newbies are the best spotters, because they ask obvious basic questions while asking for stuff beyond what geniuses could solve, and only by getting garbage answer and enduring a process of realizing its actually garbage they truly make wider picture of AI than even most powerusers, who tend to have more balanced querries.
But doesn’t happen the same with other tools. I’ll give the same exact prompt to all of LLMs I have access to and look at the responses for the best one. Grok is consistently the worst. So if it’s garbage in, garbage out, why are the other ones so much better at dealing with my garbage?
I think the primary concern of this industry right now is how, relative to the current latest generation models, we simultaneously need intelligence to increase, cost to decrease, effective context windows to increase, and token bandwidths to increase. All four of these things are real bottlenecks to unlocking the "next level" of these tools for software engineering usage.
Google isn't going to make billions on solving advanced math exams.
I'll hazard to say that cost and context windows are the two key metrics to bridge that chasm with acceptable results.... As for software engineering though, that cohort will be demanding on all front for the foreseeable future, especially because there's a bit of a competitive element. Nobody wants to be the vibecoder using sub-par tools compared to everyone else showing off their GitHub results and making sexy blog posts about it on HN.
For example, a chat bot doing recipe work should have a RAG DB that, by default, returns entire recipes. A vector DB is actually not the solution here, any number of traditional DBs (relational or even a document store) would work fine. Sure do a vector search across the recipe texts, but then fetch the entire recipe from someplace else. Current RAG solutions can do this, but the majority of RAG deployments I have seen don't bother, they just abuse large context windows.
Which looks like it works, except what you actually have in your context window is 15 different recipes all stitched together. Or if you put an entire recipe book into the context (which is perfectly doable now days!), you'll end up with the chatbot mixing up ingredients and proportions between recipes because you just voluntarily polluted its context with irrelevant info.
Large context windows allow for sloppy practices that end up making for worse results. Kind of like when we decided web servers needed 16 cores and gigs of RAM to run IBM Websphere back in the early 2000s, to serve up mostly static pages. The availability of massive servers taught bad habits (huge complicated XML deployment and configuration files, oodles of processes communicating with each other to serve a single page, etc).
Meanwhile in the modern world I've ran mission critical high throughput services for giant companies on a K8 cluster consisting of 3 machines each with .25 CPU and a couple hundred megs of RAM allocated.
Sometimes more is worse.
If you believe that an LLM is a digital brain, then it follows that their limitation in capabilities today are a result of their limited characteristics (namely: coherent context windows). If we increase context windows (and intelligence), we can simply pack more data into the context, ask specific questions, and let the LLM figure it out.
However, if you have a more grounded belief that, at best, LLMs are just one part of a more heterogeneous digital brain: It follows that maybe actually their limitations are a result of how we're feeding it data. That we need to be smarter about context engineering, we need to do roundtrips with the LLM to narrow down what thbe context should be, it needs targeted context to maximize the quality of its output.
The second situation feels so much harder, but more likely. IMO: This fundamental schism is the single reason why ASI won't be achieved on any timeframe worth making a prediction about. LLMs are just one part of the puzzle.
Specialized tools rock.
1. Embedded in the parameters
2. Within the context window
We all talk a lot about #2, but until we get a really good grip on #1, I think we as a field are going to hit a progress wall.
The problem is we have not been able to separate out knowledge embedded in parameters with model capability, famously even if you don't want a model to write code, throwing a bunch of code at a model makes it a better model. (Also famously, even if someone never grows up to work with math day to day, learning math makes them better at all sorts of related logical thinking tasks.)
Also there is plenty of research showing performance degrades as we stuff more and more into context. This is why even the best models have limits on tool call performance when naively throwing 15+ JSON schemas at it. (The technique to use RAG to determine which tool call schema to feed into the context window is super cool!)
If I'm asking Sonnet to agentically make this signin button green: does it really matter that it can also write haikus about the japanese landscape? That links back to your point: We don't have a grip, nearly at all, on how much this crosstalk between problem domains matters. Maybe it actually does matter? But certainly most of it doesn't. B
We're so far from the endgame on these technologies. A part of me really feels like we're wasting too much effort and money on training ASI ultra internet scale models. I'm never going to pay $200+/mo for even a much smarter Claude; what I need is a system that knows my company's code like the back of its hand, knows my company's patterns, technologies, and even business (Jira boards, Google docs, etc), and extrapolates from that. That would be worth thousands a month; but what I'm describing isn't going to be solved by a 195 IQ gigabrain, and it also doesn't feel like we're going to get there with context engineering.
I would imagine 95% of people never get anywhere near to hitting their CC usage. The people who are getting rate-limited have ten windows open, are auto-accepting edits, and YOLO'ing any kind of coherent code quality in their codebase.
It would be more interesting to know if it can handle problems that o3 can't do, or if it is 'correct' more often than o3 pro on these sort of problems.
i.e. if o3 is correct 90% of the time, but deep mind is correct 91% of the time on challenging organisational problems, it will be worth paying $250 for an extra 1% certainty (assuming the problem is high-value / high-risk enough).
Suppose it can't. How will you know? All the datapoints will be "not particularly interesting".
By finding and testing problems that o3 can't do on Deep Think, and also testing the reverse? Or by large benchmarks comparing a whole suite of questions with known answers.
Problems that both get correct will be easy to find and don't say much about comparative performance. That's why some of the benchmarks listed in the article (e.g. Humanity's Last Exam / AIME 2025) are potentially more insightful than one person's report on testing one question (which they don't provide) where both models replied with the same answer.
What happened to the simplicity of Steve Jobs' 2x2 (consumer vs.pro, laptop vs. desktop)?
Gemini is consistently the only model that can reason over long context in dynamic domains for me. Deep Think just did that reviewing an insane amount of Claude Code logs - for a meta analysis task of the underlying implementation. Laughable to think Grok could do that.
Is Google using this tool internally? One would expect them to give some examples of how it's helping internal teams accelerate or solve more challenging problems, if they were eating their own dogfood.
Bonus 1: Use any combination of models. Mix n match models from any lab.
Bonus 2: Serve your custom consortium on a local API from a single command using the llm-model-gateway plugin and use it in your apps and coding assistants.
https://x.com/karpathy/status/1870692546969735361
> uv tool install llm
llm install llm-consortium
llm consortium save gthink-n5 -m gemini-pro -n 5 --arbiter gemini-flash --confidence-threshold 99 --max-iterations 4
llm serve --host 0.0.0.0
curl http://0.0.0.0:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "gthink-n5",
"messages": [{"role": "user", "content": "find a polynomial algorithm for graph-isomorphism"}]
}'
You can also build a consortium of consortiums like so: llm consortium save gem-squared -m gthink-n5 -n 2 --arbiter gem-flash
Or even make the arbiter a consortium: llm consortium save gem-cubed -m gthink-n5 -n 2 --arbiter gthink-n5 --max-iteration 2
or go openweights only: llm consortium save open-council -m qwen3:2 -m kimi-k2:2 -m glm-4.5:2 -m mistral:2 --arbiter minimax-m1 --min-iterations 2 --confidence-threshold 95
https://GitHub.com/irthomasthomas/llm-consortium2. The correlated errors thing is real, though I'd argue it's not always a dealbreaker. Sometimes you want similar models for consistency, sometimes you want diversity for coverage. The plugin lets you do either - mix Claude with kimi and Qwen if you want, or run 5 instances of the same model. The "right" approach probably depends on your use case.
Here's Gemini Deep Think when prompted with:
"Create a svg of a pelican riding on a bicycle"
https://www.svgviewer.dev/s/5R5iTexQ
Beat Simon Willison to it :)
The bike is an actual bike with a diamond frame.
Time for a leaderboard?
Pelinkan on a bike - > some dude (from these labs) creates it, and it becomes part of the training data.
Pelican:
https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd....
Longer thread re gpt5:
https://old.reddit.com/r/OpenAI/comments/1mettre/gpt5_is_alr...
It's kind of fun to imagine that there is an intern in every AI company furiously trying to get nice looking svg pelicans on bicycles.
I'd highly doubt in 10 years, people are waiting 30m for answers of this quality - either due to the software side, the hardware side, and/or scaling.
It's possible in 10 years, the cost you pay is still comparable, but I doubt the time will be 30m.
It's also possible that there's still top-tier models like this that use absurd amounts of resources (by today's standards) and take 30m - but they'd likely be at a much higher quality than today's.
Pro is frustrating because it too often won't search to find current information, and just gives stale results from before its training cutoff. Flash doesn't do this much anymore.
For coding I use Pro in Gemini CLI. It is amazing at coding, but I'm actually using it more to write design docs, decomp multi-week assignments down to daily and hourly tasks, and then feed those docs back to Gemini CLI to have it work through each task sequentially.
With a little structure like this, it can basically write its own context.
Same I think also Pro got worse...
Are they aggressively quantizing, or are our expectations silently increasing ?
I've found that it hallucinates tool use for tools that aren't available and then gets very confident about the results.
Speaking of Sonnet, I feel like it's closing the gap to Opus. After the new quotas I started to try it before Opus and now it gets complex things right more often than not. This wasn't my experience just a couple of months ago.
amatic•15h ago
stingraycharles•14h ago
siva7•14h ago