https://aistudio.google.com/prompts/new_chat?model=gemini-3-...
https://aistudio.google.com/prompts/new_chat?model=gemini-3-...
- Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'. (That said, I don't have a paid plan, but I've always had pretty much unlimited access to 2.5 pro)
[Edit: working for me now in ai studio]
I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
Edit: Now that I have access to Gemini 3 preview, I've compared the results of the same one shot prompts on the gemini app's 2.5 canvas vs 3 AI studio and they're very similar. I think the rumor of a stealth launch might be true.
How come I can't even see prices without logging in... they doing regional pricing?
"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
"You've reached your rate limit. Please try again later."
Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.
when i signed up for billing via cloud console and entered my credit card, i got $300 "free credits".
i haven't thrown a difficult problem at gemini 3 pro it yet, but i'm sure i got to see it in some of the A/B tests in aistudio for a while. i could not tell which model was clearly better, one was always more succinct and i liked its "style" but they usually offered about the same solution.
I reckon it's an API key thing... you can more explicitly select a "paid API key" in AI Studio now.
> gemini-3-pro-preview-ais-applets
> gemini-3-pro-preview
For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output
https://goo.gle/enable-preview-features
-> https://github.com/google-gemini/gemini-cli/blob/release/v0....
--> https://goo.gle/geminicli-waitlist-signup
---> https://docs.google.com/forms/d/e/1FAIpQLScQBMmnXxIYDnZhPtTP...
Honestly Google models have this mix of smart/dumb that is scary. Like if the universe is turned into paperclips then it'll probably be Google model.
* 1,500 RPD (free), then $35 / 1,000 grounded prompts
to
* 1,500 RPD (free), then (Coming soon) $14 / 1,000 search queries
It looks like the pricing changed from per-prompt (previous models) to per-search (Gemini 3)
2026: cure cancer
Standard Context(≤ 200K tokens)
Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)
Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)
Long Context(> 200K tokens)
Input $4.00 vs $2.50 (same +60%)
Output $18.00 vs $15.00 (same +20%)
For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).
This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.
edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.
I could have easily come up with just the same claim, without seeing the benchmark, it doesn't exist.
Maybe if we weren't anonymous and your profile leads to credentials that you have experience in this field, otherwise I don't believe it without seeing/testing myself.
If you make some assumptions about the species of the snake, it can count as a basic python benchmark ;)
I just chatted with gemini-3-pro-preview about an idea I had and I'm glad that I did. I will definitely come back to it.
IMHO, the current batch of free, free-ish models are all perfectly adequate for my uses, which are mostly coding, troubleshooting and learning/research.
This is an amazing time to be alive and the AI bubble doomers that are costing me some gains RN can F-Off!
it's easy to focus on what they can't do
Ask it to implement tic-tac-toe in Python for command line. Or even just bring your own tic-tac toe code.
Then make it imagine playing against you and it's gonna be fast and reliable.
Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.
GPT4/3o might be the best we will ever have
No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.
>>my fairly basic python benchmark
I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.
I like to compare them using chathub using the same prompts
Gemini still calls me "the architect" in half of the prompts. It's very cringe.
Gemini still calls me "the architect" in half of the prompts. It's very cringe.
Can't say I've ever seen this in my own chats. Maybe it's something about your writing style?This exact thing is why people strongly claimed that GPT-5 Thinking was strictly worse than o3 on release, only for people to change their minds later when they’ve had more time to use it and learn its strengths and weaknesses. It takes time for people to really get to grips with a new model, not just a few prompt comparisons where luck and prompt selection will play a big role.
Is the only thing that prevents a benchmark from being meaningful publicity?
I think my benchmark is well designed. It's well designed because it's a generalization of a problem I've consistently had with LLMs on my code. Insofar that it encapsulates my coding preferences and communication style, that's the proper benchmark for me.
I've been thinking about making/publishing a new eval - if it's not public, presumably LLMs would never get better at them. But is your fear that generally speaking, LLMs tend to (I don't want to say cheat but) overfit on known problems, but then do (generally speaking) poorly on anything they haven't seen?
Thanks
Why? This is not obvious to me at all.
But if you care about its fundamental reasoning and capability to solve new problems, or even just new instances of the same problem, then it is not obvious that publishing will improve this latter metric.
Problem solving ability is largely not from the pretraining data.
I was considering working on the ability to dynamically generate eval questions whose solutions would all involve problem solving (and a known, definitive answer). I guess that this would be more valuable than publishing a fixed number of problems with known solutions. (and I get your point that in the end it might not matter because it's still about problem solving, not just rote memorization)
Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.
This probably means my test is a little too niche. The fact that it didn’t pass one of my tests doesn’t speak to the broader intelligence of the model per se.
While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted.
my bad to the google team for the cursory brush off.
> my python one needs to be down weighted or supplanted.
To me, this just proves your original statement. You can't know if an AI can do your specific task based on benchmarks. They are relatively meaningless. You must just try.
I have AI fail spectacularly, often, because I'm in a niche field. To me, in the context of AI, "niche" is "most of the code for this is proprietary/not in public repos, so statistically sparse".
TLDR; I don’t think personal benchmarks should replace the official ones of course, but I think the former are invaluable for building your intuition about the rate of AI progress beyond hype.
I constantly see failures in trivial vectors projections, broken bash scripts that don't properly quote variables (fail if space in filenames), and near completely inability to do relatively basic image processing tasks (if they don't rely on template matches).
I accidentally spent $50 on Gemeni 2.5 Pro last week, with Roo, trying to make a simple Mock interface for some lab equipment. The result: it asks permission to delete everything it did and start over...
"Create me a SVG of a pelican riding on a bicycle"
So impressive it makes you wonder if someone has noticed it being used a benchmark prompt.
Like replacing named concepts with nonsense words in reasoning benchmarks.
https://pbs.twimg.com/media/G6CFG6jXAAA1p0I?format=jpg&name=...
Also, the full document:
https://archive.org/details/gemini-3-pro-model-card/page/n3/...
The questions AND the answers are public.
If the LLM manages through reasoning OR memory to repeat back the answer then they win.
The scores represent the % of correct answers they recalled.
You could question how well this works, but it’s not like the answers are just hanging out on the public internet.
The person also claims that with thinking on the gap narrows considerably.
We'll probably have 3rd party benchmarks in a couple of days.
Just go to the leaderboard website and see for yourself: https://arcprize.org/leaderboard
- GPT-5 medium is the best
- GPT-5.1 falls right between Gemini 2.5 Pro and GPT-5 but it’s quite a bit faster
Really wonder how well Gemini 3 will perform
Secondly, benchmarks are public data, and these models are trained on such large amounts of it that it would be impractical to ensure that some benchmark data is not part of the training set. And even if it's not, it would be safe to assume that engineers building these models would test their performance on all kinds of benchmarks, and tweak them accordingly. This happens all the time in other industries as well.
So the pelican riding a bicycle test is interesting, but it's not a performance indicator at this point.
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
Temperature continues to be gated to maximum of 0.2, and there's still the hidden top_k of 64 that you can't turn off.
I love the google AI studio, but I hate it too for not enabling a whole host of advanced features. So many mixed feelings, so many unanswered questions, so many frustrating UI decisions on a tool that is ostensibly aimed at prosumers...
3 created a great "Executive Summary", identified the speakers' names, and then gave me a second by second transcript:
[00:00] Greg: Hello.
[00:01] X: You great?
[00:02] Greg: Hi.
[00:03] X: I'm X.
[00:04] Y: I'm Y.
...
Super impressive!- Hallucinated at least three quotes (that I checked) resembling nothing said by any of the hosts
- Produced timestamps that were almost entirely wrong. Language quoted from the end of the episode, for instance, was timestamped 35 minutes into the episode, rather than 85 minutes.
- Almost all of what is transcribed is heavily paraphrased and abridged, in most cases without any indication.
Understandable that Gemini can't cope with such a long audio recording yet, but I would've hoped for a more graceful/less hallucinatory failure mode. And unfortunately, aligns with my impression of past Gemini models that they are impressively smart but fail in the most catastrophic ways.
Something like
1.) Split audio into multiple smaller tracks. 2.) Perform first pass audio extraction 3.) Find unique speakers and other potentially helpful information (maybe just a short summary of where the conversation left off) 4.) Seed the next stage with that information (yay multimodality) and generate the audio transcript for it
Obviously it would be ideal if a model could handle the ultra long context conversations by default, but I'd be curious how much error is caused by a lack of general capability vs simple context pollution.
Gemini 3 Pro DeepMind Page: https://deepmind.google/models/gemini/pro/
Developer blog: https://blog.google/technology/developers/gemini-3-developer...
Gemini 3 Docs: https://ai.google.dev/gemini-api/docs/gemini-3
Google Antigravity: https://antigravity.google/
Curious to see some third-party testing of this model. Currently it seems to primarily improve of "general non-coding and visual reasoning" primarily, based on the benchmarks.
Also, the confidence interval for a such a small dataset is about 3 percent points, so these differences could just be up to chance.
I have scripted prompts for long duration automated coding workflows of the fire and forget, issue description -> pull request variety. Sonnet 4 does better than you’d expect: it generates high quality mergable code about half the time. Sonnet 4.5 fails literally every time.
- slower compared to other models that will also do the job just fine (but excels at more complex tasks),
- it's very insistent on creating loads of .MD files with overly verbose documentation on what it just did (not really what I ask it to do),
- it actually deleted a file twice and went "oops, I accidentaly deleted the file, let me see if I can restore it!", I haven't seen this happen with any other agent. The task wasn't even remotely about removing anything
And yes, I have hooks to disable 'git reset', 'git checkout', etc., and warn the model not to use these commands and why. So it writes them to a bash script and calls that to circumvent the hook, successfully shooting itself in the foot.
Sonnet 4.5 will not follow directions. Because of this, you can't prevent it like you could with earlier models from doing something that destroys the worktree state. For longer-running tasks the probability of it doing this at some point approaches 100%.
Man I've had this exact thing happen recently with Sonnet 4.5 in Claude Code!
With Claude I asked it to try tweaking the font weight of a heading to put the finishing touches on a new page we were iterating on. Looked at it and said, "Never mind, undo that" and it nuked 45 minutes worth of work by running git restore.
It immediately realized it fucked up and started running all sorts of git commands and reading its own log trying to reverse what it did and then came back 5 minutes later saying "Welp I lost everything, do you want me to manually rebuild the entire page from our conversation history?
In my CLAUDE.md I have instructions to commit unstaged changes frequently but it often forgets and sure enough, it forgot this time too. I had it read its log and write a post-mortem of WTF led it to run dangerous git commands to remove one line of CSS and then used that to write more specific rules about using git in the project CLAUDE.md, and blocked it from running "git restore" at all.
We'll see if that did the trick but it was a good reminder that even "SOTA" models in 2025 can still go insane at the drop of a hat.
I have a custom checkpointing skill that I've written that it is usually good about using, making it easier to rewind state. But that requires a careful sequence of operations, and I haven't been able to get 4.5 to not go insane when it screws up.
As I said though, watch out for it learning that it can't run git restore, so it immediately jumps to Bash(echo "git restore" >file.sh && chmod +x file.sh && ./file.sh).
Every model from every provider at every version I've used has intermingled brilliant perfect instruction-following and weird mistaken divergence.
In this case I can't get 4.5 to follow directions. Neither can anyone else, aparantly. Search for "Sonnet 4.5 follow instructions" and you'll find plenty of examples. The current top 2:
https://www.reddit.com/r/ClaudeCode/comments/1nu1o17/45_47_5...
https://theagentarchitect.substack.com/p/claude-sonnet-4-pro...
Indeed. It's almost impossible to truly know a model before spending a few million tokens on a real world task. It will take a step-change level advancement at this point for me to trust anything but Claude right now.
I think it was just timing with the way models came out. This will be the first time I will have a Gemini subscription and nothing else. This will be the first time I really see what it can do fully.
So unless you love copy/pasting code, Gemini 2.5 was useless for agentic coding.
Great for taking it's output and asking Sonnet to apply it though.
Gemini also randomly just reverts everything because of some small mistake it found, makes assumptions without checking if those are true (eg this lib absolutely HAS TO HAVE a login() method. If we get a compile error it’s my env setup fault)
It’s just not a pleasant model to work with
Claude Sonnet is way better about following up and making continuous improvements during a long running session.
For some reason Gemini will hard freeze-up on the most random queries, and when it is able to successfully continue past the first call, it only keeps a weird summarized version of its previous run available to itself, even though it's in the payload. It's a weird model.
My take is that, it's world-class at one-shotting, and if a task benefits from that, absolutely use it.
That's a "loss" I can deal with.
Edit: nvm it looks to be up for me again
Many can point to a long history of killed products and soured opinions but you can't deny theyve been the great balancing force (often for good) in the industry.
- Gmail vs Outlook
- Drive vs Word
- Android vs iOS
- Worklife balance and high pay vs the low salary grind of before.
Theyve done heaps for the industry. Im glad to see signs of life. Particularly in their P/E which was unjustly low for awhile.
Yes - less exciting! But worse?
The Android Open Source Project is not Android.
Was "Android" the way you define it ever open? Isnt it similar to chromium vs chrome? chromium is the core, and chrome is the product built on top of it - which is what allows Comet, Atlas, Brave to be built on.
That's the same thing what GrapheneOS, /e/ OS and others are doing - building on top of AOSP.
Yes. Initially all the core OS components were OSS.
Are you saying they "un-open sourced" things? Because that hasnt happened. Just beacuse a piece of code is open source doesnt mean additional services need to be open source as well.
vscode core is open source, but MS maintains closed-source stuff that builds on top of vscode. That doesnt mean vscode isnt open source anymore.
There's a long history of Google slowly making the experience worse if you want to take advantage of the things that make Android open.
For example, by moving features that were in the AOSP into their proprietary Play Services instead [1].
Or coming soon, preventing sideloading of unverified apps if you're using a Google build of Android [2].
In both cases, it's forcing you to accept tradeoffs between functionality and openness that you didn't have to accept before. You can still use AOSP, but it's a second class experience.
[1] https://arstechnica.com/gadgets/2018/07/googles-iron-grip-on...
[2] https://arstechnica.com/gadgets/2025/08/google-will-block-si...
It's like saying Nvidia's drivers are "open source" as there is a repository there but has only binaries in the folders.
*closed doors that aren't his
You are now seeing their valuation finally adjusting to that fact all thanks to DeepMind finally being put to use.
I think a lot of the harms you mentioned are real, but they're a natural consequence of capitalistic profit chasing. Governments are supposed to regulate monopolies and anti-consumer behavior like that. Instead of regulating surveillance capitalism, governments are using it to bypass laws restricting their power.
If I were a google investor, I would absolutely want them to defeat ad-blocking, ban yt-dlp, dominate the ad-market and all the rest of what you said. In capitalism, everyone looks out for their own interests, and governments ensure the public isn't harmed in the process. But any time a government tries to regulate things, the same crowd that decries this oppose government overreach.
Voters are people and they are moral entities, direct any moral outrage at us.
It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment.
They're accountable as individuals not as a collective. And it so happens, they are responsible for their government in a democracy but corporations aren't responsible for running countries.
> It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment.
In the free speech sense, sure. But your criticism isn't founded on solid ground. You should expect corporations to do whatever they have to do within the bounds of the law to turn a profit. Their responsibility is to their investors and employees, they have no responsibility to the general public beyond that which is laid out in the law.
The increasing demand in corporations being part of the public/social moral consciousness is causing them to manipulate politics more and more, eroding what little voice the individuals have.
You're trying to live in a feudal society when you treat corporations like this.
If you're unhappy with the quality of Google's services, don't do business with them. If they broke the law, they should pay for it. But expecting them to be a beacon of morality is accepting that they have a role in society and government beyond mere revenue generating machines. And if you expect them to have that role, then you're also giving them the right to enforce that expectation as a matter of corporate policy instead of law. Corporate policies then become as powerful as law, and corporations have to interfere with matters of government policy on the basis of morality instead of business, so you now have an organization with lots of money and resources competing with individual voters.
And then people have the nerve to complain about PACs, money in politics, billionaire's influencing the government, bribery,etc.. you can't have it both ways. Either we have a country run partly by corporations, and a society driven and controlled by them, or we don't.
Being in charge of a corporation shouldn't elevate someone to a platform where they have a louder voice than the common man. They can vote just as equally as others at the voting booth. they can participate in their capacity as individuals in politics. But neither money, nor corporate influence have places in the governance of a democratic society.
I talk about lawfulness because that is the only rule of law a corporation can and should be expected to follow. Morals are for individuals. Corporations have no morals. they are neither moral or immoral. Their owners have morals, and you can criticize their greed, but that is a construct of capitalism. They're supposed to enrich themselves. You can criticize them for valuing money over morals, but that's like criticizing the ocean for being wet or the sun for being too hot. It's what they do. It's their role in society.
If a small business owner raises prices to increase revenue, that isn't immoral right? even though poor people that frequent them will be disaffected? amp that up to the scale of a megacorp, and the morality is still the same.
Corporations are entities that exist for the sole purpose of generating revenue for their owners. So when you criticize Google, you're criticizing a logical organization designed to do the thing you're criticizing it of doing. The CEO of google is acting in his official capacity, doing the job they were hired to do when they are resisting adblocking. The investors of Google are risking their money in anticipation of ROI, so their expectation from Google is valid as well.
When you find something to be immoral, the only meaningful avenue of expressing that with corporations is the law. You're criticizing google as if it was an elected official we could vote in/out of office. or as if it is an entity that can be convinced of its moral failings.
When we don't speak up and user our voice, we lose it.
For every "OMG, internet is filled with ads", people are conveniently forgetting the real-world impact of ALL COMPANIES (and not just Apple) btw. Either you should be upset with the system, and not selectively at Google.
I also hold Americans and western consumers are responsible for simply allowing that to happen. As long as the human rights abuses and corruption are 3 or 4 degrees of separation from the retailer, people seem to be perfectly OK with chattel slavery and child labor and indentured servitude and all the human suffering that sits at the base of all our wonderful technology and cheap consumer goods.
If we want to have things like minimum wage and workers rights and environmental protections, then we should mandate adherence to those standards globally. If you want to sell products in the US, the entire supply chain has to conform to US labor and manufacturing and environmental standards. If those standards aren't practical, then they should be tossed out - the US shouldn't be doing performative virtue signalling as law, incentivizing companies to outsource and engage in race to the bottom exploitation of labor and resources in other countries. We should also have tariffs and import/export taxes that allow competitive free trade. It's insane that it's cheaper to ship raw materials for a car to a country in southeast asia, have it refined and manufactured into a car, and then shipped back into the US, than to simply have it mined, refined, and manufactured locally.
The ethics and economics of America are fucking dumb, but it's the mega-corps, donor class, and uniparty establishment politicians that keep it that way.
Apple and Google are inhuman, autonomous entities that have effectively escaped the control and direction of any given human decision tree. Any CEO or person in power that tried to significantly reform the ethics or economics internally would be ousted and memory-holed faster than you can light a cigar with a hundred dollar bill. We need term limits, no more corporation people, money out of politics, and an overhaul, or we're going to be doing the same old kabuki show right up until the collapse or AI takeover.
And yeah, you can single out Google for their misdeeds. They, in particular, are responsible for the adtech surveillance ecosystem and lack of any viable alternatives by way of their constant campaign of enshittification of everything, quashing competition, and giving NGOs, intelligence agencies, and government departments access to the controls of censorship and suppression of political opposition.
I haven't and won't use Google AI for anything, ever, because of any of the big labs, they are most likely and best positioned to engage in the worst and most damaging abuse possible, be it manipulation, invasion of privacy, or casual violation of civil rights at the behest of bureaucratic tyrants.
If it's not illegal, they'll do it. If it's illegal, they'll only do it if it doesn't cost more than they can profit. If they profit, even after getting caught and fined and taking a PR hit, they'll do it, because "number go up" is the only meaningful metric.
The only way out is principled regulation, a digital bill of rights, and campaign finance reform. There's probably no way out.
They don't, all cobalt in Apple products is recycled.
> and whitewashing what they do by using corporate layering and shady deals to put themselves at sufficient degrees of separation from problematic labor and sources to do good PR, but not actually decoupling at all.
They don't, Apple audits their entire supply chain so it wouldn't hide anything if something moved to another subcontractor.
0. https://www.apple.com/newsroom/2023/04/apple-will-use-100-pe...
That would be bad if it happened, which is why it doesn't happen. Working in a factory isn't an inhumane condition.
And what of the people that ravenously support ads and ad-supported content, instead of paying?
What of the consumptive public? Are they not responsible for their choices?
I do not consume algorithmic content, I do not have any social media (unless you count HN for either).
You can't have it both ways. Lead by example, stop using the poison and find friends that aren't addicted. Build an offline community.
Also, HN is by definition algorithmic content and social media, in your mind what do you think it is?
Even 25 years ago people wouldn't even believe Youtube exists. Anyone can upload whatever they want, however often they want, Youtube will be responsible for promoting it, they'll provide to however many billions users want to view it, and they'll pay you 55% of the revenue it makes?
They are forcing a payment method on us. It's basically like they have their hand in our pockets.
Most of the time, you need to pick one. Modern advertising is not based on finding the item with the most utility for the user - which means they are aimed at manipulating the user's behaviour in one way or another.
But I hear you say - you can use iPhones for productive things and not just mindless brainrot. And that's the same with YouTube as well. Many waste time on YouTube, but many learn and do productive things.
Dont paint everything with a single, large, coarse brush stroke.
Taking those products from where there were to the juggernauts they are today was not guaranteed to succeed, nor was it easy. And yes plenty of innovation happened with these products post aquisition.
I don't think what you're saying proves that the companies that were acquired couldn't have done that themselves.
You mean Drive vs OneDrive or, maybe Docs vs Word?
But at least this company is not run by a narcissistic sociopath.
They have done a lot, but most of it was in the "don't be evil" days and they are a fading memory.
It's good there's competition in the space though.
Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.
Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.
https://imgcdn.stablediffusionweb.com/2024/4/19/8e54925a-004...
For the above pic I asked "What is wrong with the image?"
Result: - It totally missed the most obvious one - six fingers.
Instead it said this:
Based on a close inspection of the image, here are the things that are "wrong" or anomalous, indicating that it is likely an AI-generated image:
- The Thumb Anatomy: This is the most obvious error. The digit in the thumb's position (far left) looks exactly like a long index finger. A natural human thumb has two segments (phalanges) and is shorter and thicker, opposing the other fingers. This digit appears to have three segments and is roughly the same length and shape as the other fingers, giving the hand the appearance of having five fingers and no thumb.
- Unnatural Skin Texture: The skin is unnaturally smooth, waxy, and free of normal human imperfections like distinct pores, fine wrinkles, or hair. It has a "plastic" or "airbrushed" look typical of generative AI.
- Palm Lines: The creases on the palm are vague and generic. They don't follow the distinct pattern of major flexion creases (like the heart, head, and life lines) found on a real human hand.
- Wrist Connection: The connection between the hand and the wrist looks slightly distorted, with tendons that appear too prominent and smooth for the pose.
[Edit: 3.0 is same as 2.5 - both answered almost identically]
Max: https://chat.qwen.ai/c/ca671562-7a56-4e2f-911f-40c37ff3ed79
VL-235B: https://chat.qwen.ai/c/21cc5f4e-5972-4489-9787-421943335150
(Source: "The mind is flat" by Nick Chater)
What I am referring to is a much more basic form of shape recognition that goes on at the level of the neural networks in the retina.
My armchair take would be that watt usage probably isn't a good proxy for computational complexity in biological systems. A good piece of evidence for this is from the C. elegans research that has found that the configuration of ions within a neuron--not just the electrical charge on the membrane--record computationally-relevant information about a stimulus. There are probably many more hacks like this that allow the brain to handle enormous complexity without it showing up in our measurements of its power consumption.
Jaxley: Differentiable simulation enables large-scale training of detailed biophysical models of neural dynamics [1]
They basically created sofware to simulate real neurons and ran some realistic models to replicate typical AI learning tasks:
"The model had nine different channels in the apical and basal dendrite, the soma, and the axon [39], with a total of 19 free parameters, including maximal channel conductances and dynamics of the calcium pumps."
So yeah, real neurons are a bit more complex then ReLU or Sigmoid.
[1] https://www.biorxiv.org/content/10.1101/2024.08.21.608979v2....
That said, I think there is a good reason to be skeptical that it is a good chance. The consistent trend of finding higher complexity than expected in biological intelligences (like in C. Elegans), combined with the fact that the physical nature of digital architectures versus biological architectures are very different, is a good reason to bet on it being really complex to emulate with our current computing systems.
Obviously there is a way to do it physically--biological systems are physical after all--but we just don't understand enough to have the grounds to say it is "likely" doable digitally. Stuff like the Universal Approximation Theorem implies that in theory it may be possible, but that doesn't say anything about whether it is feasible. Same thing with Turing completeness too. All that these theorems say is our digital hardware can emulate anything that is a step-by-step process (computation), but not how challenging it is to emulate it or even that it is realistic to do so. It could turn out that something like human mind emulation is possible but it would take longer than the age of the universe to do it. Far simpler problems turn out to have similar issues (like calculating the optimal Go move without heuristics).
This is all to say that there could be plenty of smart ideas out there that break our current understandings in all sorts of ways. Which way the cards will land isn't really predictable, so all we can do is point to things that suggest skepticism, in one direction or another.
Physical analog chemical circuits whose physical structure directly is the network, and use chemistry/physics directly for the computations. For example, a sum is usually represented as the number of physical ions present within a space, not some ALU that takes in two binary numbers, each with some large number of bits, requiring shifting electrons to and from buckets, with a bunch of clocked logic operations.
There are a few companies working on more "direct" implementations of inference, like Etched AI [1] and IBM [2], for massive power savings.
[1] https://en.wikipedia.org/wiki/Etched_(company)
[2] https://spectrum.ieee.org/neuromorphic-computing-ibm-northpo...
Meanwhile you could benchmark for something actually useful. If you're about to say "But that means it won't work for my use case of identifying a person on a live feed" or whatever, then why don't you test that? I really don't understand the kick people get of successfully tricking LLMs on non productive task with no real world application. Just like the "how many r in strawberry?", "uh uh uh it says two urh urh".. ok but so what? What good is a benchmark that is so far from a real use case?
It's a perfectly valid benchmark and very telling.
I see that AI is reaching the level of a middle school boy...
Input: $1.25 -> $2.00 (1M tokens)
Output: $10.00 -> $12.00
Squeezes a bit more margin out of app layer companies, certainly, but there's a good chance that for tasks that really require a sota model it can be more than justified.
If we're paying more for a more petaflop heavy model, it makes sense that costs would go up. What really would concern me is if companies start ratcheting prices up for models with the same level of performance. My hope is raw hardware costs and OSS releases keep a lid on the margin pressure.
"Users"? Or people that get presented with it and ignore it?
Cringe. To get to 2 billion a month they must be counting anyone who sees an AI overview as a user. They should just go ahead and claim the "most quickly adopted product in history" as well.
I'd be comfortable calling a 'user' anyone who clicked to expand the little summary. Not sure what else you'd call them.
Normal users (i.e. not grumpy techies ;) ) probably just go with the flow rather than finding it irritating.
> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.
So your Gmails are being read by Gemini and is being put on the training set for future models. Oh dear and Google is being sued over using Gemini for analyzing user's data which potentially includes Gmails by default.
Where is the outrage?
[0] https://web.archive.org/web/20251118111103/https://storage.g...
[1] https://www.yahoo.com/news/articles/google-sued-over-gemini-...
"gmail being read by gemini" does NOT mean "gemini is trained on your private gmail correspondence". it can mean gemini loads your emails into a session context so it can answer questions about your mail, which is quite different.
> in accordance with Google’s relevant terms of service, privacy policy
That said, LLMs are the most data-greedy technology of all time, and it wouldn't surprise me that companies building them feel so much pressure to top each other they "sidestep" their own TOSes. There are plenty of signals they are already changing their terms to train when previously they said they wouldn't--see Anthropic's update in August regarding Claude Code.
If anyone ever starts caring about privacy again, this might be a way to bring down the crazy AI capex / tech valuations. It is probably possible, if you are a sufficiently funded and motivated actor, to tease out evidence of training data that shouldn't be there based on a vendor's TOS. There is already evidence some IP owners (like NYT) have done this for copyright claims, but you could get a lot more pitchforks out if it turns out Jane Doe's HIPAA-protected information in an email was trained on.
Google is betting that moving email and cloud is such a giant hassle that almost no one will do it, and ditching YT and Maps is just impossible.
Will run extended benchmarks later, let me know if you want to see actual data.
It generated a blender script that makes the model.
I guess openscad would be a sweet spot in the middle. Good shout, might experiment.
I'm not being nit picky here. I think there are issues beyond terminology that you may not be familiar with, as it is clearly not your field. That's ok.
The "design" in computer aided design is engineering design. This is not the same definition of "design" used in, say, graphic design. Something is not called CAD because it helps you create an image that looks like a product on a computer. It is CAD because it creates engineering design files (blueprints) that can be used for the physical manufacture of a device. This places very tight and important constraints on the methods used, and capabilities supported.
Blender is a sculpting program. Its job is to create geometry that can be fed into a rendering program to make pretty pictures. Parasolid is a CAD geometry kernel at the core of many CAD programs, which has the job of producing manufacturable blueprints. The operations supported map to physical manufacturing steps - milling, lathe, and drill operations. The modeling steps use constraints in order to make sure, e.g., that screw holes line up. Blender doesn't support any of that.
To an engineer, saying that an LLM gave you a blender script for a CAD operation is causing all sorts of alarm klaxons to go off.
... without plugins. https://www.cadsketcher.com/
If Blender can export a .3mf file format and slicer gets it ready for 3D printing (gcode that actually instructs the printer head). Is the slicer actually CAD software?
And if you can export many formats that work with some manufacturing devices and you built a model in blender, did blender not help you with CAD?
In high school CAD/CAM we used various CAD programs for designing (sculpting?) things and then imported them into CAM to generate g code programs, set tool constraints and such
Results are amazing! 2.5 and 3 seems way way head.
2.5 stands between GPT-5 and GPT-5.1, where GPT-5 is the best of the 3.
In preliminary evals Gemini 3 seems to be way better than all, but I will know when I run extended benchmarks tonight.
Even tools like Claude Code have only been fully released for six months, and they've already had a pretty dramatic impact on how many developers work.
It scares the absolute shit out of everyone.
It's clear far beyond our little tech world to everyone this is going to collapse our entire economic system, destroy everyone's livelihoods, and put even more firmly into control the oligarchic assholes already running everything and turning the world to shit.
I see it in news, commentary, day to day conversation. People get it's for real this time and there's a very real chance it ends in something like the Terminator except far worse.
Not trying to challenge you, and I'd sincerely love to read your response. People said similar things about previous gen-AI tool announcements that proved over time to be overstated. Is there some reason to put more weight in "what people on HN said" in this case, compared to previous situations?
1. They likely work at the company (and have RSUs that need to go up)
2. Also invested in the company in the open market or have active call options.
3. Trying to sell you their "AI product".
4. All of the above.
I don't see any other way of doing this. The people who keep reading and following comments either here on HN, from LocalLlama or otherwise will continue to be misinformed by all the FUD and guerilla marketing that is happening across all of these places.
Google being able to skip ahead of every other AI company is wild. They just sat back and watched, then decided it was time to body the competition.
The DOJ really should break up Google [1]. They have too many incumbent advantages that were already abuse of monopoly power.
[1] https://pluralpolicy.com/find-your-legislator/ - call your reps and tell them!
The problem with Google is that someone had to show them how to make a product out of the thing, which Open AI did.
Then Anthropic taught them to make a more specific product out of there models
In every aspect, they're just playing catch up, and playing me too.
Models are only part of the solution
The reason for this is because LLM companies have tuned their models to aggressively blow smoke up their users' asses.
These "tools" are designed to aggressively exploit human confirmation bias, so as to prevent the user from identifying their innumerable inadequacies.
Not to be a negative nelly, but these numbers are definitely inflated due to Google literally pushing their AI into everything they can, much like M$. Can't even search google without getting an AI response. Surely you can't claim those numbers are legit.
You're implying they're lying?
Marketing is always somewhere in the middle
Unless these numbers are just lies, I'm not sure how this is "pushing their AI into everything they can". Especially on iOS where every user is someone who went to App Store and downloaded it. Admittedly on Android, Gemini is preinstalled these days but it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise.
Now OTOH "AI overviews now have two billion users" can definitely be criticised in the way you suggest.
Yes and no, my power button got remapped to opening Gemini in an update...
I removed that but I can imagine that your average user doesn't.
As an Android and Google Workspace user, I definitely feel like Google is "pushing their AI into everything they can", including the Gemini app.
This isnt a hottake at all. Marketing (iPhone keynotes, product launches) are about showing impressive numbers. It isnt a gotcha you think it is.
Showing a new screen by default to everyone who is using your main product flow and then claiming that everyone who is seeing it is a priori a "user" is absurd. And that is the only way they can get to 2 billion a month, by my estimation.
They could put a new yellow rectangle at the top of all google search results and claim that the product launch has reached 2 billion monthly users and is one of the fastest-growing products of all time. Clearly absurd, and the same math as what they are saying here. I'm claiming my hottake gotcha :)
For example I don't pay for ChatGPT or Claude, even if they are better at certain tasks or in general. But I have Google One cloud storage sub for my photos and it comes with a Gemini Pro apparently (thanks to someone on HN for pointing it out). And so Gemini is my go to LLM app/service. I suspect the same goes for many others.
I hope Gemini 3 is not the same and it gives an affordable plan compared to OpenAI/Anthropic.
https://old.reddit.com/r/wallstreetbets/comments/1oz6gjp/new...
The mention markets are pure degenerate gambling and everyone involved knows that
https://kalshi.com/markets/kxminajmention/nicki-minaj/kxmina...
For the kind of person playing these sorts of games, that actually really "hype".
You’re spot on that people should think of who is on the other side of the trades they’re taking, and be extremely paranoid of being adversely selected.
Disallowing people from making terrible trades seems…paternalistic? Idk
None of whom were forced by anyone to place bets in the first place.
Its transcribing the meeting but hallucinates badly... both in fast and thinking mode. Fast mode only transcribed about a fifth of the meeting before saying its done. Thinking mode completely changed the topic and made up ENTIRE conversations. Gemini 2.5 actually transcribed it decently, just occasional missteps when people talked over each other.
I'm concerned.
Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.
Open models and startups can innovate, but the platforms can immediately put their AI in front of billions of users without asking anyone to change behavior (not even typing a new URL).
One of them isnt the same as others (hint: It is Apple). The only thing Apple is doing with Maps is, is adding ads https://www.macrumors.com/2025/10/26/apple-moving-ahead-with...
Is there evidence that's true? That the other models are significantly better than the ones you named?
Using Anthropic or OpenAI's models are incredibly straightforward -- pay us per month, here's the button you press, great.
Where do I go for this for these Google models?
Does that have any relation to the Gemini plan thing: https://one.google.com/explore-plan/gemini-advanced?utm_sour...
?
It's not exactly the same since e.g. Copilot adds prompts, reduces context, etc.
I've even heard it in mainstream circles that have no idea what HN is, and aren't involved in tech.
Probably would have been cheaper to keep Google Reader running - kidding, but this is the first time I remember the gut punch of Google cancelling something I heavily used personally.
They are also bad about strategy. Good example is the number of messaging systems that have had. Instead of making new ones, they should have updated existing one with new backend and UI.
I like the Google Messages sync SMS online with Google Fi, but it is missing features. If they could do it globally, they would have something big.
But this is if you want to pay per token. Otherwise you should just be able to use your Gemini Pro subscription (it doesn't need Ultra). Subscriptions are at https://gemini.google/subscriptions/
Is that correct? I can't even upgrade a Group separately?
I can't find the announcement anymore, but you can see it under benefits here https://support.google.com/googleone/answer/14534406?hl=en
The initial separate subscriptions were confusing at best. Current situation is pretty much same as Anthropic/OpenAI - straightforward
Edit: changed ~1 month ago (https://old.reddit.com/r/Bard/comments/1npiv2o/google_ai_pro...)
I poked around and couldn't figure this out.
"Public preview Individual plan $0/month"
"Coming soon Team plan"
I also tried to use Gemini 3 in my Gemini CLI and it's not available yet (it's available to all Ultra, but not all Pro subscribers), I needed to sign up to a waitlist
All in all, Google is terrible at launching things like that in a concise and understandable way
This is just irritating. I am not going to give them money until I know I can try their latest thing and they've made it hard for me to even know how I can do that.
However, Gemini CLI is a rather bad product. There is (was?) an issue that makes the CLI fall back to flash very soon in every session. This comment explains it well: https://news.ycombinator.com/item?id=45681063
I haven't used it in a while, except for really minor things, so I can't tell if this is resolved or not
I had the exact same experience and walked away to chatgpt.
What a mess.
Can the mathematicians here try ask it to invent new novel math related to [Insert your field of specialization] and see if it comes up with something new and useful?
Try lowering the temperature, use SymPy etc.
If you are transferring a conversation trace from another model, ... to bypass strict validation in these specific scenarios, populate the field with this specific dummy string:
"thoughtSignature": "context_engineering_is_the_way_to_go"
[1] https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high...Very subjective benchmark, but it feels like the new SOTA for hard tasks (at least for the next 5 minutes until someone else releases a new model)
The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.
The ARC puzzles in question: https://arcprize.org/arc-agi/2/
Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.
But I think this is also fair to use any means to beat it.
However, many people have worked hard to optimize tools specifically for ARC over many years, and it's proven to be a particularly hard test to optimize for. This is why I find it so interesting that LLMs can do it well at all, regardless of whether tests like it are included in training.
ARC do not provide this kind of dataset, only a small public one and a private one where they do the benchmarks.
Building your own large private ARC set does not seem too difficult if you have enough resources.
The trick is to not put more value in the score than what it is.
Like with humans [1], generalized reasoning ability lets you skip the direct storage of that solution, and many many others, completely! You can just synthesize a solution when a problem is presented.
they can target benchmark directly, not just replica. If google or OAI are bad actors, they already have benchmark data from previous runs.
Not only do you have the financial self-interest to do it (helps with capital raising to be #1), but you are worried that your competitors are doing it, so you may as well cheat to make things fair. Easy to do and easy to justify.
Maybe a way to make the benchmark more robust to this adversarial environment is to introduce noise and random red herrings into the question, and run the test 20 times and average the correctness. So even if you assume they're training on it, you have some semblance of a test still happening. You'd probably end up with a better benchmark anyway which better reflects real-world usage, where there's a lot of junk in the context window.
- semi-private, which they use to test proprietary models and which could be leaked
-private: used to test downloadable open source models.
ARG-AGI prize itself is for open source models.
If you want to train your model you'd need more data than the private set anyway. So you have to build a very large training set on your own, using the same kind of puzzles.
It is not that hard, really, just tedious.
We have a global RL Pipeline on our hand.
If there is something new a LLM/AI model can't solve today, plenty of humans can't either.
But tomorrow every LLM/AI model can solve it and again plent of humans still can't.
Even if AGI is just the sum of companies adding more and more trainingdata, as long as this learning pipeline becomes faster and easier to train with new scenarios, that will start to bleed out humans in the loop.
There is for a fact teams creating puzzles to RL against as training environments. As it’s beneficial to RL training and in particular compute efficient if you schedule the environment difficulty throughout training. There was a great recent paper on this. Creating environment data that generalizes outside the environment is a challenging engineering task and super valuable whether it looks like AGC AGI or not.
Also ARC AGI is general enough that if you create similar data you’re just creating generic visual puzzle data. Should all visual puzzle data be off limits ?
Every company releasing new models leads with benchmark numbers, so it's hard to imagine they are not all putting a lot of effort into benchmark-maxxing.
I haven't seen it in the box yet, and pricing is unknown https://cloud.google.com/blog/products/ai-machine-learning/r...
If it's good enough to be useful on your code base, it better be good enough to instruct you on how to use it.
How easy it is depends on whether or not they've built that kind of thing in
> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face. Please pay attention to the correct alignment of the numbers, hour markings, and hands on the face.
[0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
.... why? For what possible reason? No, I'm not going to give access to my privately stored file share in order to view a prompt someone has shared. Come on, Google.
It probably makes some sense internally in big tech corporation logic (no new data storage agreements on top of the ones the user has already agreed to when signing up for Drive etc.), but as a user, I find it incredibly strange too – especially since the text chats are in some proprietary format I can't easily open on my local GDrive replica, but the images generated or uploaded just look like regular JPEGs and PNGs.
// Calculate rotations
// We use a cumulative calculation logic mentally, but here simple degrees work because of the transition reset trick or specific animation style.
// To prevent the "spin back" glitch at 360->0, we can use a simple tick without transition for the wrap-around,
// but for simplicity in this specific React rendering, we will stick to standard 0-360 degrees.
// A robust way to handle the spin-back on the second hand is to accumulate degrees, but standard clock widgets often reset.> The station clocks in Switzerland are synchronised by receiving an electrical impulse from a central master clock at each full minute, advancing the minute hand by one minute. The second hand is driven by an electrical motor independent of the master clock. It takes only about 58.5 seconds to circle the face; then the hand pauses briefly at the top of the clock. It starts a new rotation as soon as it receives the next minute impulse from the master clock.[3] This movement is emulated in some of the licensed timepieces made by Mondaine.
"Against the Run": https://www.youtube.com/watch?v=7xfvPqTDOXo
But maybe this example was used by many other people and so it proliferated?
I made a few improvements... which all worked on the first try... except the ticking sound, which worked on the second try (the first try was too much like a "blip")
The prompt the other person was using is:
``` Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting. ```
Which is much more difficult.
For what it's worth, I supplied the same prompt as the OG clock challenge and it utterly failed, not only generating a terrible clock, but doing so with a fair bit of typescript: https://ai.studio/apps/drive/1c_7C5J5ZBg7VyMWpa175c_3i7NO7ry...
as a calendar or "date" system, we start at UNIX time's creation, so it's currently 1.76 gigaseconds AUNIX. You might use megaseconds as the "week" and gigaseconds more like an era, e.g. Queen Elizabeth III's reign, persisting through the entire fourth gigasecond and into the fifth. The clock also displays teraseconds, though this is just a little purple speck atm. of course, this can work off-Earth where you would simply use 88.775ks as the "day"; the "dates" a Martian and Earthling share with each other would be interchangeable.
I can't seem to get anyone interested in this very serious venture, though... I guess I'll have to wait until the 50th or so iteration of Figure, whenever it becomes useful, to be able to build a 20-foot-tall physical metric UNIX clock in my front yard.
I'm not surprised. I'm French and one thing I've consistently seen with Gemini is that it loves to use Title Case (Everything is Capitalized Except the Prepositions) even in French or other languages where there is no such thing. A 100% american thing getting applied to other languages by the sheer power of statistical correlation (and probably being overtrained on USA-centric data). At the very least it makes it easy to tell when someone is just copypasting LLM output into some other website.
If this is an American thing I'm happy to disown/denounce it; it's my least favorite pattern in Gemini output.
Also, can you guess which pelican SVG was gemini 3 vs 2.5? https://xcancel.com/xundecidability/status/19908113191723213...
Asking the follow up "what are ALL the individuals mentioned in the story" results in both models listing ~40 names and both of those lists include Yudkowsky.
> Eliezer Yudkowsky is a central figure in the article, mentioned multiple times as the intellectual originator of the community from which the "Zizians" splintered. His ideas and organizations are foundational to the entire narrative.The story isn't about Yudkowsky. At each level of summarization you have to make hard decisions about what to keep. Not every story about the United States needs to mention George Washington.
Edit: And upon skimming the article at the points where Yudkowsky's name is mentioned, I 100% agree with stickfigure.
I challenge you to name one way in which the story falls apart without the mention of Yudkowsky.
>do you think the model did a bad job then in it's second response
Yes, very obviously it told you what you wanted to hear. This is behavior that should not be surprising to you.
"Wtf - no mention of Yudkowsky?"
Also that is the position you've been defending this whole thread. This whole conversation is happening because you believe Yudkowsy is an important figure to the story.Here is specifically how he features in the article:
* *The Founder:* Yudkowsky is identified as the founder of the Machine Intelligence Research Institute (MIRI) and the Center for Applied Rationality (CFAR), as well as the founder of the LessWrong forum. These are the institutions the Zizians originally belonged to and eventually declared war against. * *The Source Code:* The Zizians based their radical worldview on Yudkowsky’s concepts, specifically *"timeless decision theory."* However, they believed they were applying his logic more rigorously than he was. They became convinced that Yudkowsky’s organizations had betrayed these principles (specifically regarding a rumor about MIRI paying blackmail to cover up a scandal), which they viewed as a moral failing that justified their rebellion.
They are *wrong*, and provided exactly the same immaterial evidence as you did in this thread(I still insist that the article suffers zero damage if you remove Yudkowsky from it and instead only mention the institutions and concepts that stem from him), but with all the behavior I've seen now, the summary which was the initial issue of this thread should have included him.
[What I would've really liked to do was to prompt for another person of equal non-prominence who was in the article but not in the summary, and see what comes up. But I sure am not reading the 80-102 minute article just for this and we're unlikely to find an agreement about the 'equal non-prominence' part if I challenged you to pick one.]
Antigravity uses Open-VSX and can't be configured differently even though it says it right there (setting is missing). Gemini website still only lists 2.5 Pro. Guess I will just stick to Claude.
I absolutely LOVE that Google themselves drew a sharp distinction here.
Come on, you can’t be serious.
But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.
Given the nature of how those models work, you don't need exact replicas.
Gemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this problem took 14min, 20min and 1h14min respectively
Even thought I expect this sort of problem to very much be in the distribution of what the model has been RL-tuned to do, it's wild that frontier model can now solve in minutes what would take me days
But seeing these results I'd be surprised if by the end of the decade we don't have something that is to these puzzles what Stockfish is to chess. Effectively ground truth and often coming up with solutions that would be absolutely ridiculous for a human to find within a reasonable time limit.
I’m curious to explore such fun programming code. But I’m also curious to explore what knowledgeable humans consider to be both “ground truth” as well as “absolutely ridiculous” to create within the usual time constraints.
Stockfish is a superhuman chess program. It's routinely used in chess analysis as "ground truth": if Stockfish says you've made a mistake, it's almost certain you did in fact make a mistake[0]. Also, because it's incomparably stronger than even the very best humans, sometimes the moves it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them in tournament conditions.
Obviously software development in general is way more open-ended, but if we restrict ourselves to puzzles and competitions, which are closed game-like environments, it seems plausible to me that a similar skill level could be achieved with an agent system that's RL'd to death on that task. If you have base models that can get there, even inconsistently so, and an environment where making a lot of attempts is cheap, that's the kind of setup that RL can optimize to the moon and beyond.
I don't predict the future and I'm very skeptical of anybody who claims to do so, correctly predicting the present is already hard enough, I'm just saying that given the progress we've already made I would find plausible that a system like that could be made in a few years. The details of what it would look like are beyond my pay grade.
---
[0] With caveats in endgames, closed positions and whatnot, I'm using it as an example.
From what I've seen, sometimes the computer correctly assesses that the "bad" move opens up some kind of "checkmate in 45 moves" that could technically happen, but requires the opponent to see it 45 moves ahead of time and play something that would otherwise appear to be completely sub-optimal until something like 35 moves in, at which point normal peak grandmasters would finally go "oh okay now I get the point of all of that confusing behavior, and I can now see that I'm going to get mated in 10 moves".
So, the computer is "right" - that move is worse if you're playing a supercomputer. But it's "wrong" because that same move is better as long as you're playing a human, who will never be able to see an absurd thread-the-needle forced play 45-75 moves ahead.
That said, this probably isn't what GP was referring to, as it wouldn't lead to an assignment of a "brilliant" move simply for failing to see the impossible-to-actually-play line.
Poker is funny because you have humans emulating human-beating machines, but that’s hard enough to do that you have players who don’t do this win as well.
But mostly what happens is that a move isn't so good, but it isn't so bad either, and as the computer will tell you it is sub-optimal, a human won't be able to refute it in finite time and his practical (as opposed to theoretical) chances are reduced. One great recent example of that is Pentala Harikrishna's recent queen sacrifice in the world cup, amazing conception of a move that the computer say is borderline incorrect, but leads to such complications and a very uncomfortable position for his opponent that it was practically a great choice.
Do you have any links? I haven't seen any such (forget GM, not even Magnus), barring the opponent making mistakes.
Because ya, that doesn't happen lol.
https://chess.stackexchange.com/questions/29716/positions-th...
It basically comes down to “ideas that are rare enough that they were never programmed into a chess engine”.
Blockades or positions where no progress is possible are a common theme. Engines will often keep tree searching where a human sees an obvious repeating pattern.
Here’s also an example where 2 engines are playing, and deep mind finds a move that I think would be obvious to most grandmasters, yet stockfish misses it https://youtu.be/lFXJWPhDsSY?si=zaLQR6sWdEJBMbIO
That being said, I’m not sure that this necessarily correlates with brilliancy. There are a few of these that I would probably get in classical time and I’m not a particularly brilliant player.
> it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them...
> ... in tournament conditions.
I'm suggesting that I'd like to see the ones that humans have found - outside of tournament conditions. Perhaps the gulf between us arises from an unspoken reference to solutions "unrealistic to expect a human to find" without the window-of-time qualifier?
Otherwise I think the mainframe would lose because of being too passive
That said, chess is such a great human invention. (Go is up there too. And texas no-limit hold'em poker. Those are my top 3 votes for "best human tabletop games ever invented". They're also, perhaps not uncoincidentally, the hardest for computers to be good at. Or, were.)
If you look on Youtube there are many channels where strong players analyze these games. As Demis Hassabis once put it, it's like chess from another dimension.
If you want to see this against someone like Magnus, it is rare as super GMs do not spend a lot of time playing engines publicly.
But if you want to see them against a normal chess master somewhere between master and international master, it is every where. For e.g. this guy analyses his every match afterwards and you frequently here "oh I would never see that line":
https://www.youtube.com/playlist?list=PLp7SLTJhX1u6zKT5IfRVm...
(start watching around 1000+ for frequently seeing those moments)
Comparing raw Elo ratings isn't very indicative IMHO, but I do find it plausible that in closed, game-like environments models could indeed achieve the superhuman performance the Elo comparison implies, see my other comment in this thread.
Sadly, the answer was wrong.
It also returned 8 "sources", like stackexchange.com, youtube.com, mpmath.org, ncert.nic.in, and kangaroo.org.pk, even though I specifically told it not to use websearch.
Still a useful tool though. It definitely gets the majority of the insights.
Prompt: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
well, there's your problem. it behaves like a search summary tool and not like a problem solver if you enable google search
I'm glad I was able to inspire a new username for you. But aren't you concerned that if you let other people influence you like that, you're frying your brain? Shouldn't everything originate in your own mind?
> They don't provide any value except to a very small percentage of the population who safely use them to learn
There are many things that only a small percentage of the population benefit from or care about. What do you want to do about that? Ban those things? Post exclamation-filled comments exhorting people not to use them? This comes back to what I said at the end of my previous comment:
You might want to make sure you understand what you’re trying to achieve.
Do you know the answer to that?
> A language model is not the same as a convolution neural network finding anomalies on medical imagining.
Why not? Aren't radiologists "frying their brains" by using these instead of examining the images themselves?
The last paragraph of your other comment was literally the Luddite argument. (Sorry I can't quote it now.) Do you know how to weave cloth? No? Your brain is fried!
The world changes, and I find it more interesting and challenging to change with it, than to fight to maintain some arbitrary status quo. To quote Ghost in the Shell:
All things change in a dynamic environment. Your effort to remain what you are is what limits you.
For me, it's not about "getting ahead" as you put it. It's about enjoying my work, learning new things. I work in software development because I enjoy it. LLMs have opened up new possibilities for me. In that 5 year future you mentioned, I'm going to have learned a lot of things that someone not using LLMs will not have.
As for being dependent on Altman et al., you can easily go out and buy a machine that will allow you to run decent models yourself. A Mac, a Framework desktop, any number of mini PCs with some kind of unified memory. The real dependence is on the training of the models, not running them. And if that becomes less accessible, and new open weight models stop being released, the open weight models we have now won't disappear, and aren't going to get any worse for things like coding or searching the web.
> Keep falling for lesswrong bs.
Good grief. Lesswrong is one of the most misleadingly named groups around, and their abuse of the word "rational" would be hilarious if it weren't sad. In any case, Yudkowsky advocated being ready to nuke data centers, in a national publication. I'm not particular aware of their position on the utility of AI, because I don't follow any of that.
What I'm describing to you is based on my own experience, from the enrichment I've experienced from having used LLMs for the past couple of years. Over time, I suspect that kind of constructive and productive usage will spread to more people.
> There are many things that only a small percentage of the population benefit from or care about. What do you want to do about that?
---There are many things from our society that I would like to ban that are useful to a small percentage of the population, or at least should be heavily regulated. Guns for example. A more extreme example would be cars. Many people drive 5 blocks when they could walk to their (and everyone else's) detriment. Forget the climate, it impacts everyone ( break dust, fumes, pedestrian deaths). Some cities create very expensive tolls / parking fees to prevent this, this angers most people and is seen as irrational by the masses but is necessary and not done enough. Open Free societies are a scam told to us by capitalist that want to exploit without any consequences.
--- I want to air-gap all computers in classrooms. I want students to be expelled for using LLMs to do assignments, as they would have been previously for plagiarism (that's all an llm is, a plagiarism laundering machine).
---During COVID there was a phenomenon where some children did not learn to speak until they were 4-5 years old, and some of those children were even diagnosed with autism. In reality, we didn't understand fully how children learned to speak, and didn't understand the importance of the young brain's need to subconsciously process people's facial expressions. It was Masks!!! (I am not making a statement on masks fyi) We are already observing unpredictable effects that LLMs have on the brain and I believe we will see similar negative consequences on the young mind if we take away the struggle to read, think and process information. Hell I already see the effects on myself, and I'm middle aged!
> Why not? Aren't radiologists "frying their brains" by using these instead of examining the images themselves?
--- I'm okay with technology replacing a radiologist!!! Just like I'm okay with a worker being replaced in an unsafe textile factory! The stakes are higher in both of these cases, and obviously in the best interest of society as a whole. The same cannot be said for a machine that helps some people learn while making the rest dependent on it. Its the opposite of a great equalizer, it will lead to a huge gap in inequality for many different reasons.
We can all say we think this will be better for learning, that remains to be seen. I don't really want to run a worldwide experiment on a generation of children so tech companies can make a trillion dollars, but here we are. Didn't we learn our lesson with social media/porn?
If Uber's were subsidized and cost only $20.00 a month for unlimited rides, could people be trusted to only use it when it was reasonable or would they be taking Uber's to go 5 blocks, increasing the risk for pedestrians and deteriorating their own health. They would use them in an irresponsible way.
If there was an unlimited pizza machine that cost $20.00 a month to create unlimited food, people would see that as a miracle! It would greatly benefit the percentage of the population that is food insecure, but could they be trusted to not eat themselves into obesity after getting their fill? I don't think so. The affordability of food, and the access to it has a direct correlation to obesity.
Both of these scenarios look great on the surface but are terrible for society in the long run.
I could go on and on about the moral hazards of LLMs, there are many more outside of just the dangers of learning and labor. We are being told they are game changing by the people who profit off them..
In the past, empires bet their entire kingdom's on the words of astronomers and magicians who said they could predict the future. I really don't see how the people running AI companies are any different than those astronomers (they even say they can predict the future LOL!)
They are Dunning Kruger plagiarism laundering machines as I see it. Text extruding machines that are controlled by a cabal of tech billionaires who have proven time and time again they do not have societies best interest at heart.
I really hope this message is allowed to send!
The problem with such approaches is that it involves some people imposing their opinions on others, “for their own good”. That kind of thing often doesn’t turn out well. The Amish address that by letting their children leave to experience the outside world, so that their return is (arguably) voluntary - they have an opportunity to consent to the Amish social contract.
But what you seem to be doing is making a determination of what’s good for society as a whole, and then because you have no way to effect that, you argue against the tools that we might abuse rather than the tendencies people have to abuse them. It seems misplaced to me. I’m not saying there are no societal dangers from LLMs, or problems with the technocrats and capitalists running it all, but we’re not going to successfully address those issues by attacking the tools, or people who are using them effectively.
> In the past, empires bet their entire kingdom's on the words of astronomers and magicians who said they could predict the future.
You’re trying to predict the future as well, quite pessimistically at that.
I don’t pretend to be able to predict the future, but I do have a certain amount of trust in the ability of people to adapt to change.
> that's all an llm is, a plagiarism laundering machine
That’s a possible application, but it’s certainly not all they are. If you genuinely believe that’s all they are, then I don’t think you have a good understanding of them, and it could explain some of our difference in perspective.
One of the important features of LLMs is transfer learning: their ability to apply their training to problems that were not directly in their training set. Writing code is a good example of this: you can use LLMs to successfully write novel programs. There’s no plagiarism involved.
https://archive.nytimes.com/www.nytimes.com/books/97/05/18/r...
However, I don't agree that AI is a risk to the extreme levels you seem to think it is. The truth is that humans have advanced by use of technology since the first tool and we are horrible predictors at what the use case of these technologies will bring.
So far they have been mostly positive, I don't see a long term difference here.
They have, believe it or not, very little power to stop kids from choosing to use cheating engines on their personal laptops. Universities are not Enterprise.
You’ve probably heard of the Luddites, the group who destroyed textile mills in the early 1800s. If not: https://en.wikipedia.org/wiki/Luddite
Luddites often get a bad rap, probably in large part because of employer propaganda and influence over the writing of history, as well as the common tendency of people to react against violent means of protest. But regardless of whether you think they were heroes, villains, or something else, the fact is that their efforts made very little difference in the end, because that kind of technological progress is hard to arrest.
A better approach is to find ways to continue to thrive even in the presence of problematic technologies, and work to challenge the systems that exploit people rather than attack tools which can be used by anyone.
You can, of course, continue to flail at the inevitable, but you might want to make sure you understand what you’re trying to achieve.
> Malcolm L. Thomas argued in his 1970 history “The Luddites” that machine-breaking was one of the very few tactics that workers could use to increase pressure on employers, undermine lower-paid competing workers, and create solidarity among workers. "These attacks on machines did not imply any necessary hostility to machinery as such; machinery was just a conveniently exposed target against which an attack could be made."[10] Historian Eric Hobsbawm has called their machine wrecking "collective bargaining by riot", which had been a tactic used in Britain since the Restoration because manufactories were scattered throughout the country, and that made it impractical to hold large-scale strikes.
Of course, there would have been people who just saw it as striking back at the machines, and leaders who took advantage of that tendency, but the point is it probably wasn’t as simple as the popular accounts suggest.
Also, there’s a kind of corollary to the lump of labor fallacy, which is arguably a big reason the US is facing such a significant political upheaval today: when you disturb the labor status quo, it takes time - potentially even generations - for the economy to adjust and adapt, and many people can end up relatively worse off as a result. Most US factory workers and miners didn’t end up with good service industry jobs, for example.
Sure, at a macro level an economist viewing the situation from 30,000 feet sees no problem - meanwhile on the ground, you end up with millions of people ready to vote for a wannabe autocrat who promises to make things the way they were. Trying to treat economics as a discipline separate from politics, sociology, and psychology in these situations can be misleading.
Nice 'solidarity' there!
> Most US factory workers and miners didn’t end up with good service industry jobs, for example.
Which people are you talking about? More specifically, when?
As long as overall unemployment stays low and the economy keeps growing, I don't see much of a problem. Even if you tried to keep everything exactly as is, you'll always have some people who do better and some who do worse; even if just from random chance. It's hard to blame that on change.
See eg how the draw down of the domestic construction industry around 2007 was handled: construction employment fell over time, but overall unemployment was low and flat. Indicating an orderly shuffling around of workers from construction into the wider economy. (As a bonus point, contrast with how the Fed unnecessarily tanked the wider economy a few months after this re-allocation of labour had already finished.)
> Sure, at a macro level an economist viewing the situation from 30,000 feet sees no problem - meanwhile on the ground, you end up with millions of people ready to vote for a wannabe autocrat who promises to make things the way they were. Trying to treat economics as a discipline separate from politics, sociology, and psychology in these situations can be misleading.
It would help immensely, if the Fed were more competent in preventing recessions. Nominal GDP level targeting would help to keep overall spending in the economy on track.
No, not at all. What makes you think so? Israel (and to a lesser extent Australia) managed to skip the Great Recession on account of having competent central banks. But they didn't have any more 'perfect' market participants than any other economy.
Russia, of all places, also shows right now what a competent central bank can do for your economy---the real situation is absolutely awful on account of the 'special military operation' and the sanctions both financial and kinetic. See https://en.wikipedia.org/wiki/Elvira_Nabiullina for the woman at the helm.
See also how after the Brexit referendum the Bank of England wisely let the Pound exchange rate take the hit---instead of tanking the real economy trying to defend the exchange rate.
> They can soften or delay recessions by socializing mistakes and redistributing wealth using interest rates, [...]
Btw, not all central banks even use interest rates for their policies.
You are right that the central banks are sometimes involved in bail outs, but just as often it's the treasury and other more 'fiscal' parts of the government. I don't like 'Too big to fail' either. Keeping total nominal spending on a stable path would help ease the temptation to bail out.
Why does this need any reconciliation? That's working as expected: when productivity improves in some sectors, we don't need as much labour there as before, and thus it needs to be shuffled around. This can have all kinds of knock-on effects.
As long as central bank is doing at least a halfway competent job, overall unemployment will stay low and stable. Ideally, you have people quit for a new job instead of getting fired, but in the grand scheme of things it doesn't make too much of a difference, as long as in aggregate they find new jobs.
An interesting example is furnished by the US between early 2006 and late 2007: hundreds of thousand people left employment in construction, and during that same period, the overall US unemployment rate stayed remarkably flat (hovering around 4.5% to 4.7%). The US economy was robust enough to handle a housing construction bust.
(Of course, after this was all done and dusted, some people declared that house prices were too high and the public demanded that they be brought down. So obligingly in 20008 the Fed engineered a recession that accomplished exactly that..)
Two big ifs: the central bank is competent and enough and people find new jobs.
Don't get me wrong: I am for progress and technological innovation. That's why we're working, to make our lives easier. But progress needs to be balanced, so that the changes it brings are properly absorbed by society.
That's only one 'if'. Well, the second 'people finding jobs' is a given if you have a half-way competent central bank and a regulations even slightly less insane than South Africa's.
But let's worry about technological unemployment once we actually see it. So far it has been elusive. (Even in South Africa, it's not technology but their own boneheaded policies that drive the sky high unemployment. They ain't technically more advanced than the rest of the world.)
Second, there are far fewer junior jobs in software development, again attributed to the advance of AI.
That’s... not at all a valid generalization. There’s all kinds of things that other actors can do to throw things too out of whack for the the monetary policies tools typically available to central banks to be sufficient to keep things sailing nicely. One big danger here is bad action (or inaction in the face of exogenous crisis) by the main body of the government itself.
One rare counter-example was perhaps Covid, where we had a real issue cause a recession.
That's not to say that real issues don't cause problems. Far from it! They just don't cause a recession, if the central bank is alert. The prototypical example is perhaps the UK economy after the Brexit referendum in 2016:
The leave vote winning was a shock to the British economy, but the Bank of England wisely let the Pound exchange rate take the hit, instead of tanking the economy trying to defend the exchange rate. As a result, British GDP (as eg measured in Euro) immediately shrank by a few percent and the expected path of future real GDP also shrank; but crucially: there was no recession nor its associated surge in unemployment.
For another example have a look at Russia in the last few years. Thanks to the very competent hands of Elvira Nabiullina at the Bank of Russia, the Russian economy has perhaps been creaking under the strain of war and sanctions but has not slid into recession.
Summary: real issue cause problems for the economy, but they don't have to cause a recession, if the central bank is alert. (That's in economies with a central bank. Central banks are actually more of an arsonist than a fire fighter here.)
There are two separate issues here: whether tech itself is bad, and whether the way it is deployed is bad. Better AI is, in principle, the kind of tech that can massively change the world for the better. In practice it is being deployed to maximize profits because that's what we chose to incentivize in our society above everything else, but the problem is obviously the incentives (and the people that they enable), not the tech itself.
(Well, the Soviets did have one sector that performed reasonably well, and that's partially because they set plenty of decent incentives there: weapons production and the military.)
Now you could say that the 'wrong' activities are profitable. And, I agree and I am all for eg CO2 taxes or making taxes on equity financing cheaper than those on debt and deposits (to incentivise companies, especially banks, to rely more on stocks than on debt, to decrease brittle leverage in the economy); or lowering subsidies for meat production or for burning food instead of eating it etc.
I'm rooting for biological cognitive enhancement through gene editing or whatever other crazy shit. I do not want to have some corporation's AI chip in my brain.
As the available work increases in complexity, I reckon more will push themselves to take jobs further out of their comfort zone. Previously, the choice was to upskill for the challenge and greater earnings, or stay where you are which is easy and reliable; the current choice is upskill or get a new career. Rather than switch careers to something you have zero experience in. That puts pressure on the moderately higher-skill job market with far fewer people, and they start to upskill to outrun the implosion, which puts pressure on them to move upward, and so on. With even modest productivity gains in the whole industry, it’s not hard for me to envision a world where general software development just isn’t a particularly valuable skill anymore.
The final kicker in this simple story is that there are many, many narcissistic folks in the C-suite. Do you really think Sam Altman and Co are going to take blame for Billy's shitty vibe coded breach? Yeah right. Welcome to the real world of the enterprise where you still need an actual throat to choke to show your leadership skills.
With respect to profitability - there's none in sight. When JP Morgan [0] is saying that $650B in annual revenue is needed to make a paltry 10% on investment there is no way any sane financial institution would pump more money into that sunk cost. Yet, here we are building billions of dollars in datacenters for what... Mediocre chat bots? Again these thing don't think. They don't reason. They're massive word graphs being used in clever ways with cute, humanizing descriptions. Are they useful for helping a human parse way more information than we can reason about at once? For sure! But that's not worth trillions in investment and won't yield multiples of the input. In fact I'd argue the AI landscape would be much better off if the dollars stopped flowing because that would mean real research would need to be done in a much more efficient and effective manner. Instead we're paying individual people hundreds of millions of dollars who, and good for them, have no clue or care on what actually happens with AI because: money in the bank. No, AI in it's current form is not profitable, and it's not going to be if we continue down this path. We've literally spent world changing sums of money on models that are used to create art that will displace the original creators well before they will solve any level of useful world problems.
Finally, and to your last point: "...good quality coders...". How long do you think that will be a thing with respect to how this is all unfolding? Am I writing better code (I'm not a programmer by day) with LLMs? Yes and no. Yes when I need to build a visually appealing UI for something. And yes when it comes to a framework. But what I've found is if I don't put all of the right pieces in the right places before I start I end up with an untenable mess into the first couple thousand lines of that code. So if people stop becoming "good quality programmers" then what? These models only get better with better training data and the web will continue to go insular against these IP stealing efforts. The data isn't free, it never has been. And this is why we're now hearing the trope of "world models". A way to ask for trillions more to provide millionths of a penny on the invested dollar.
[0] https://www.tomshardware.com/tech-industry/artificial-intell...
Its an old userscript so it is glitchy and halfway works. I already pre-chewed the work by telling Gemini 3 exactly which new HTML elements it needs to match and which contents it needs to parse. So basically, the scaffolding is already there, the sources are already there, it just needs to put everything in place.
It fails miserably and produces very convincing looking but failing code. Even letting it iterate multiple times does nothing, nor does nudging it in the correct direction. Mind you that Javascript is probably the most trained-on language together with Python, and parsing HTML is one of the most common usecases.
Another hilarious example is MPV, which has very well-documented settings. I used to think that LLMs would mean you can just tell people to ask Gemini how to configure it, but 9 out of 10 times it will hallucinate a bunch of parameters that never existed.
It gives me an extremely weird feeling when other people are cheering that it is solving problems at superhuman speeds or that it coded a way to ingest their custom XML format in record time, with relatively little prompting. It seems almost impossible that LLMs can both be so bad and so good at the same time, so what gives?
2. I've found the same with Gemini; I can rarely get it to actually do useful things. I have tried many times, but it just underperforms compared to the other mainstream LLMs. Other people have different experiences, though, so I suspect I'm holding it wrong.
Of course, for short one-off scripts, it's amazing. It's also really good at preliminary code reviews. Although if you have some awkward bits due to things outside of your power it'll always complain about them and insist they are wrong and that it can be so much easier if you just do it the naive way.
Amazon's Kiro IDE seems to have a really good flow, trying to split large projects into bite sized chunks. I, sadly, couldn't even get it to implement solitaire correctly, but the idea sounds good. Agents also seem to help a lot since it can just do things from trial and error, but company policy understandably gets complicated quick if you want to provide the entire repo to an LLM agent and run 'user approved' commands it suggests.
On one of my projects, I downloaded a library’s source code locally, and asked Claude to write up a markdown file explaining documenting how to use it with examples, etc.
Like, taking your example for solitaire, I’d ask a LLM to write the rules into a markdown file and tell the coding one to refer to those rules.
I understand it to be a bit like mise en place for cooking.
You tell it what you want and it gives you a list of requirements, which are in that case mostly the rules for Solitaire.
You adjust those until you're happy, then you let it generate tasks, which are essentially epics with smaller tickets in order of dependency.
You approve those and then it starts developing task by task where you can intervene at any time if it starts going off track.
The requirements and tasks, it does really well, but the connection of the epics/larger tasks is where it crumbles mostly. I could have made it work with some more messing around but I've noticed over a couple projects that, at least in my tries, it always crumbles either at the connection of the epics/large tasks or when you ask it to do a small modification later down the line and it causes a lot of smaller, subtle changes all over the place. (could say skill issue since I oversaw something in the requirements, but that's kind of how real projects go, so..)
It also eats tokens like crazy for private usage but that's more so a 'playing around' problem. As it stands I'll probably blow 100$ a day if I connect it to an actual commercial repo and start experimenting. Still viable with my salary, but still..
This is mostly because HA changes so frequently and the documentation is sparse. To get around this and increase my correction rate, I give it access to the source code of the same version I'm running. Then instructions in CLAUDE.md on where to find source and it must use source code.
This fixes 99% of my issues.
It does showcase that LLMs don't truly "think" when it's not even able to search for and find the things mentioned. But, even then this configuration has been stable for years and the training data should have plenty of mentions.
Use ./home-assistant/core for the source code of home assistant, its the same version that I'm running. Always search and reference the source when debugging a problem.
I also have it frequently do deep dives into source code on a particular problem and write a detailed md file so it only needs to do that once.
"Deep dive into this code, find everything you can find about automations and then write a detailed analysis doc with working examples and source code, use the source code."
Why?
> (for) most research projects, it would not help to have input from the general public. In fact, it would just be time-consuming, because error checking
Since frontier LLMs make clumsy mistakes, they may fall into this category of 'error-prone' mathematician whose net contributions are actually negative, despite being impressive some of the time.
What boggles the mind: we have gone for so long to try to strive for correctness and suddenly being right 70% of the time and wrong the remaining 30% is fine. The parallel with self driving is pretty strong here: solving 70% of the cases is easy, the remaining 30% are hard or maybe even impossible. Statistically speaking these models do better than most humans, most of the time. But they do not do better than all humans, and they can't do it all of the time and when they get it wrong they make such tremendously basic mistakes that you have to wonder how they manage to get things right.
Maybe it's true that with an ever increasing model size and more and more (proprietary, the public sources are exhausted by now so private data is the frontier where model owners can still gain an edge) we will reach a point where the models will be right 98% of the time or more but what would be the killer feature for me is an indication of the confidence level of the output. Because no matter whether junk or pearls it all looks the same and that is more dangerous than having nothing at all.
This is a case where I would not use human performance as the standard to beat. Training people to be both intellectually honest and statistically calibrated is really hard.
I don't think I did something special too ChatGPT to get it to do this, but it's started reporting confidence levels to me, eg from my most recent chat:
> In China: you could find BEVs that cost same or even less than ICE equivalents in that size band. (Confidence ~0.70)
How fast can you check the contribution? How small of a part is it? An unsolicited contribution is different from one you immediately directed. Do you need to reply? How fast are followups? Multi-day back and forths are a pain, a fast directed chat is different. You don't have to worry about being rude to an LLM.
Then it comes down to how smart a frontier model is vs the people who write to mathematicians. The latter groups will be filled with both smart helpful people and cranks.
We're not exactly swimming in power generation and efficient code uses less power.
I have had no success using LLM's to solve this particular problem until trying Gemini 3 just now despite solutions to it existing in the training data. This has been my personal litmus test for testing out LLM programming capabilities and a model finally passed.
What if they have an internal corpus of new and curated knowledge that is constantly updated by humans and accessed in a similar manner? It could be active even if web search is turned off.
They would surely add the latest Euler problems with solutions in order to show off in benchmarks.
just create a different problem if you don't believe it.
In his Nobel Prize winning speech, Demis Hassabis ends by discussing how he sees all of intelligence as a big tree-like search process.
Gemini 3 somehow is able to give a list of mayors, including details on who got impeached, etc.
This should be a simple answer, because all the data is on wikipedia, that certainly the models are trained on, but somehow most models don't manage to give that answer right, because... it's just a irrelevant city in a huge dataset.
But somehow, Gemini 3 did it.
Edit: Just asked "Cool places to visit in Londrina" (In portuguese), and it was also 99% right, unlike other models, who just create stuff. The only thing wrong here, it mentioned sakuras in a lake... Maybe it confused with Brazilian ipês, which are similar, and indeed the city it's full of them.
It seems to have a visual understanding, imo.
Gemini 3 nailed on the first try, included political affiliation, and added some context on who they competed with and won over in each of the last 3 elections. And I just did a fun application with AI Studio, and it worked on first shot. Pretty impressive.
(disclaimer: Googler, but no affiliation with Gemini team)
I wouldn't be surprised if the smallest models can answer fewer such (fact-only) questions over time offline as they distill/focus them more thoroughly on logic etc.
It shows once again that for common searches, (indexed) data is the king, and that's where I expect that even a simple LLM directly connected to a huge indexed dataset would win against much more sophisticated LLMs that have to use agents for searching.
> Model was published after the competition date, making contamination possible.
Aside from eval on most of these benchmarks being stupid most of the time, these guys have every incentive to cheat - these aren't some academic AI labs, they have to justify hundreds of billions being spent/allocated in the market.
Actually trying the model on a few of my daily tasks and reading the reasoning traces all I'm seeing is same old, same old - Claude is still better at "getting" the problem.
You say "probabilistic generation" like it's some kind of a limitation. What is exactly the limiting factor here? [(0.9999, "4"), (0.00001, "four"), ...] is a valid probability distribution. The sampler can be set to always choose "4" in such cases.
To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.
You need to verify what you're doing, detect when you make a mistake, and backtrack to try a different approach.
Loos like AI slop
Also panarky denies it.
AI researcher: "Just another trillion dollars. This time we'll reach superintelligence, I swear."
> I'm now writing a Python script to automate the summation computation. I'm implementing a prime sieve and focusing on functions for Rm and Km calculation [...]
I personally think anthropomorphizing LLMs is a bad idea.
My use case is using Gemini to help me test a card game I'm developing. The model simulates the board state and when the player has to do something it asks me what card to play, discard... etc. The game is similar to something like Magic the Gathering or Slay the Spire with card play inspired by Marvel Champions (you discard cards from your hand to pay the cost of a card and play it)
The test is just feeding the model the game rules document (markdown) with a prompt asking it to simulate the game delegating the player decisions to me, nothing special here.
It seems like it forgets rules less than Gemini 2.5 Pro using thinking budget to max. It's not perfect but it helps a lot to test little changes to the game, rewind to a previous turn changing a card on the fly, etc...
What are you doing to help kill bash?
Also probably over for OpenAI
It’s obviously nowhere near a real game, but the fact that it can structure and render something that coherent from a single prompt is kind of wild. Curious to see how far this generation can actually go once the tooling matures.
Anecdotal, but I've also not experienced any regression in Gemini quality where Claude/OpenAI might push iterative updates (or quantized variants for performance) that cause my test bench to fail more often.
This was hard for me to accept initially as I've learned to be anti-Google over the years, but the better accuracy was too good to pass up on. Still expecting a rugpull eventually — price hike, killing features without warning, changing internal details that break everything — but it hasn't happened yet.
Code quality was fine for my very limited tests but I was disappointed with instruction following.
I tried few tricks but I wasn't able to convince it to first present plan before starting implementation.
I have instructions describing that it should first do exploration (where it tried to discover what I want) then plan implementation and then code, but it always jumps directly to code.
this is bug issue for me especially because gemini-cli lacks plan mode like Claude code.
for codex those instructions make plan mode redundant.
Just early anecdote from trying to build that 1 SaaS application though.
What really impressed me was when I told it that I wanted a particular component’s UI to be cleaned up but I didn’t know how exactly, just wanted to use its deep design expertise to figure it out, and it came up with a UX that I would’ve never thought of and that was amazing.
Another important point is that the error rate for my session yesterday was significantly lower than when I’ve used any other model.
Today I will see how it does when I use it at work, where we have a massive codebase that has particular coding conventions. Curious how it does there.
Gemini 3 worked much better and I actually committed the changes that it created. I don't mean its revolutionary or anything but it provided a nice summary of my request and created a decent simple solution. Sonnet had created a bunch of overarching changes that I would not even bother reviewing. Seems nice. Will probably use it for 2 weeks until someone else releases a 1.0001x better model.
Grok 4 is at 92.1, GPT-5 Pro at 83.9, Claude Opus 4.1 Thinking 16K at 58.8.
Gemini 2.5 Pro scored 57.6, so this is a huge improvement.
That said, OpenAI was in the same place earlier in the year and very quickly became the top agentic platform with GPT-5-Codex.
The AI crowd is surprisingly not sticky. Coders quickly move to whatever the best model is.
Excited to see Gemini making a leap here.
I tried Codex for a short while but quickly went back to Claude. Found myself having to revert Codex changes all the time. Maybe I had subconsciously altered my workflow/prompting to work well with Claude, but somehow wasn't providing Codex with the correct context, not sure.
That's actually sad, and if you're - like I am - long in the tooth in computer land, you should definitely try agentic in CLI mode.
I haven't been that excited to play with a computer in 30 years.
I still use ChatGPT and Codex as a user but in the API project I’ve been working on Gemini 2.5 Pro absolutely crushed GPT-5 in the accuracy benchmarks I ran.
As it stands Gemini is my de facto standard for API work and I’ll be following very closely the performance of 3.0 in coming weeks.
From my experience, both are capable and can solve nearly all the same complex programming requests, but time and time again Gemini spits out reams and reams of code so over engineered, that totally works, but I would never want to have to interact with.
When looking at the code, you can't tell why it looks "gross", but then you ask Claude to do the same task in the same repo (I use Cline, it's just a dropdown change) and the code also works, but there's a lot less of it and it has a more "elegant" feeling to it.
I know that isn't easy to capture in benchmarks, but I hope Gemini 3.0 has improved in this regard
It's fine though if you take the time to learn what it's doing and write a nicer version of it yourself
but I would never want to have to interact with
That is its job security ;)Generate SVG animation of following:
1 - There is High fantasy mage tower with a top window a dome
2 - Green goblin come in front of tower with a torch
3 - Grumpy old mage with beard appear in a tower window in high purple hat
4 - Mage sends fireball that burns goblin and all screen is covered in fire.
Camera view must be from behind of goblin back so we basically look at tower in front of us:
https://codepen.io/Runway/pen/zxqzPyQ
PS: but yeah thats attempt #20 or something.
edit: flashing lights at the end seem to be mostly becauseo f darkreader extension
> setTimeout(() => showSub("Ah, Earl Grey.", 2000), 1000);
> setTimeout(() => showSub("Finally some peace.", 2000), 3500);
> // Scene 2
> setTimeout(() => showSub("Armor Clanking", 2000), 7000);
> setTimeout(() => showSub("Heavy Breathing", 2000), 10000);
If we will lose our jobs to this dumb slop I'd rather be happy doing something else
I mean I crafted complete complex game prototype using Gemini 2.5 Pro with nearly zero coding. I done it in a week: with client-server architecture, robust networking, AI, acceptance test coverage, replays.
It just different way to build software. You just spend 30% of time on specification, 30% on testing and 30% on refactoring also using AI.
Actual slop generarion take like 10% of time and rest of the time you turn it into maintainable code.
Of course you can do it manually, but then it will take 5-10 times the time and you wont be as flexible in changing things because with AI you can do major refactoring in a day, but manually it could take weeks and kill the project.
PS: I might eventually post here about rapid game prototyping with LLM.
If you actually have some credibility feel free to reach me via Linkedin or non-anonymous email and I'd happily share my experience or the code.
For the code quality it can really be as good or as bad ad as you desire. In this case it is what it is because I put zero effort into it.
I suck at writing software, like bad. I can't remember syntax at all. I couldn't write working code on a whiteboard if you asked me.
But I don't know how to solve problems very well, and I'm good at understanding what people want and don't want. I do understand logic and pseudocode.
The code LLMs write is good enough for 99% of the things I need it for, and I'm not writing code that will be used in some life determining situation, and I'd wager that most aren't either.
We could debate on if my code is usable/supportable long-term, by myself or others. However, I don't see how that debate would be any different if I wrote it myself (worse) or somebody else.
TLDR; in my opinion, the interesting question is less what happens at the top firms or to top engineers than what happens as the rest of the world gains access to engineering skills well above the previous floor at a reasonable price point.
But seriously, we lost a lot when Flash was killed. It was an era of accessible animation and games like Newgrounds and Homestar Runner, that had no ready replacement.
Hey either it nailed it or it didn't.
Glad to see Google still can't get out of its own way.
Maybe they are keeping that itself secret, but more likely they probably just have had humans generate an enormous number of examples, and then synthetically build on that.
No benchmark is safe, when this much money is on the line.
> When you think about divulging this information that has been helpful to your competitors, in retrospect is it like, "Yeah, we'd still do it," or would you be like, "Ah, we didn't realize how big a deal transformer was. We should have kept it indoors." How do you think about that?
> Some things we think are super critical we might not publish. Some things we think are really interesting but important for improving our products; We'll get them out into our products and then make a decision.
If they had found something fundamentally new, I doubt they would've snuck it into Gemini 3. Probably would cook on it longer and release something truly mindblowing. Or, you know, just take over the world with their new omniscient ASI :)
Yes, I know the Workspaces workaround, but that’s silly.
I would cancel my ChatGPT sub immediately if Gemini had a desktop app and may still do so if it continues to impress my as much as it has so far and I will live without the desktop app.
It's really, really, really good so far. Wow.
Note that I haven't tried it for coding yet!
I completely understand the appeal of having local and offline applications, but the ChatGPT desktop app doesn't work without an internet connection anyways. Is it just the convenience? Why is a dedicated desktop app so much better than just opening a browser tab or even using a PWA?
Also, have you looked into open-webui or Msty or other provider-agnostic LLM desktop apps? I personally use Msty with Gemini 2.5 Pro for complex tasks and Cerebras GLM 4.6 for fast tasks.
(1) The ability to add context via a local apps integration into OS level resources is big. With Claude, eg, I hit Option-SPC which brings up a prompt bar. From there, taking a screenshot that will get sent my prompt is as simple as dragging a bounding box. This is great. Beyond that, I can add my own MCP connectors and give my desktop app direct access to relevant context in a way that doesn't work via web UI. It may also be inconvenient to give context to a web UI in some case where, eg, I may have a folder of PDFs I want it to be able to reference.
(2) Its own icon that I can CMD-TAB to is so much nicer. Maybe that works with a PWA? Not really sure.
(3) Even if I can't use an LLM when offline, having access to my chats for context has been repeatedly valuable to me.
I haven't looked at provider-agnostic apps and, TBH, would be wary of them.
Good point. I can see why integrated support for local filesystem tools would be useful, even though I prefer manually uploading specific files to avoid polluting the context with irrelevant info.
> Its own icon that I can CMD-TAB to is so much nicer
Fair enough. I personally prefer Firefox's tab organization to my OS's window organization, but I can see how separating the LLM into its own window would be helpful.
> having access to my chats for context has been repeatedly valuable to me.
I didn't at all consider this. Point ceded.
> I haven't looked at provider-agnostic apps and, TBH, would be wary of them.
Interesting. Why? Is it security? The ones I've listed are open source and auditable. I'm confident that they won't steal my API keys. Msty has a lot of advanced functionality that I haven't seen in other interfaces like allowing you to compare responses between different LLMs, export the entire conversation to Markdown, and edit the LLM's response to manage context. It also sidesteps the problem of '[provider] doesn't have a desktop app' because you can use any provider API.
Access to OS level resources != context pollution. You still have control, just more direct and less manual.
> The ones I've listed are open source and auditable.
Yeah I don't plan on spending who knows how much time auditing some major app's code (lol) before giving it my API keys and access to my chats. Unless there's a critical mass of people I know and trust using something like that it's not going to happen for me.
But also, I tried quickly looking up Msty to see if it is open source and what its adoption looked like and AFAICT it's not open source. Asked Gemini 3 if it was and it also said no. Frankly that makes it a very hard no for me. If you are using it because you think it's Open Source I suggest you stop.
I did not know that. Thank you very much for the correction. I guess I have some keys to revoke now.
Is it just me or is it generally the case that to pay for anything on the internet you have to enter credit card information including a phone number.
Perhaps its country specific?
Is it possible they use the same base pre-trained model and just fine-tuned and RL-ed it better (which, of course, is where all the secret sauce training magic is these days anyhow)? That would be odd, especially for a major version bump, but it's sort of what having the same training cutoff points to?
> This model is not a modification or a fine-tune of a prior model.
I'm curious why they decided not to update the training data cutoff date too.
Unless training on the pelican increases all SVG ability, then good job.
https://gally.net/temp/20251107pelican-alternatives/index.ht...
There seem to be one or two parsing errors. I'll fix those later.
I tried the first one and 5 Pro gives this: https://imgur.com/a/EhYroCE
https://gally.net/temp/20251107pelican-alternatives/index.ht...
As your example shows, GPT-5 Pro would probably be better that GPT-5.1, but the tokens are over ten times more expensive and I didn’t feel like paying for them.
Extending beyond the pelican is very interesting, especially until your page gets enough recognition to be "optimized" by the AI companies.
It seems both Gemini 3 and latest ChatGPTs get a deep understanding of the representation of SVGs that seems a difficult task. I would be incapable of writing a SVG without visualizing the result and a graphical feedback loop.
PS: Would be fun to add "animated" in the short prompt since some models think of animation by themselves. Tried manually with 5 Pro (using the subscription), and in a sense it's worse than the static image. To start, there's a error: https://bafybeie7gazq46mbztab2etpln7sqe5is6et2ojheuorjpvrr2u...
I noticed that, on my page, Gemini 3.0 Pro did produce one animated SVG without being asked, for “#8Generate an SVG of an elephant typing on a typewriter.” Kind of cute, actually.
As for whether the images on the page will enter LLM training data: In the page’s HTML are meta tags I had Claude give me to try to prevent scraping:
<meta name="googlebot" content="noai, noimageai"> <meta name="googlebot-news" content="nosnippet"> <meta name="AdsBot-Google" content="noindex"> <meta name="GPTBot" content="noindex, nofollow"> <meta name="ChatGPT-User" content="noindex, nofollow"> <meta name="Google-Extended" content="noindex, nofollow"> <meta name="anthropic-ai" content="noindex, nofollow"> <meta name="Claude-Web" content="noindex, nofollow"> <meta name="CCBot" content="noindex, nofollow"> <meta name="omgili" content="noindex, nofollow"> <meta name="omgilibot" content="noindex, nofollow"> <meta name="PerplexityBot" content="noindex, nofollow"> <meta name="YouBot" content="noindex, nofollow"> <meta name="tdm-reservation" content="1"> <meta property="ai:training" content="disallowed"> <meta property="ai:scraping" content="disallowed">
Who knows if they will work, though.
Do regular users know how to disable AI Overviews, if they don't love them?
Gemini 3:
The cognitive dissonance in this thread is staggering.
We are sitting here cheering for a model that effectively closes the loop on Google’s total information dominance, while simultaneously training our own replacements.
Two things in this thread should be terrifying, yet are being glossed over in favor of "look at the pelican SVG":
1. *The Model Card Leak:* As user `rvz` pointed out, the leaked model card implies training on "user data... in accordance with Google's relevant terms of service." Given the recent terms updates, this essentially confirms that Gemini 3 is likely refined on the corpus of Gmail and Drive data that they’ve been hoarding for decades. We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.
2. *The "Agentic" Trap:* Everyone praising how `Gemini 3` solves complex bugs one-shot or handles entire PRs is missing the macro view. We aren’t looking at a productivity tool; we are looking at the devaluation of the software engineer. When the entry barrier drops to "can you prompt a seminal agent," the economic value of what we do on this forum evaporates.
Google has successfully gamified us into feeding the very beast that will make the "14-minute human solve time" (referenced by `lairv`) irrelevant. We are optimizing for our own obsolescence while paying a monopoly rent to do it.
Why is the sentiment here "Wow, cool clock widget" instead of "We just handed the keys to the kingdom to the biggest ad-tech surveillance machine in history"?
That feels like something between a hallucination and an intentional fallacy that popped up because you specifically said "intense discussion". The increase is 60% on input tokens from the old model, but it's not a markup, and especially not "sold back to us at X markup".
I've seen more and more of these kinds of hallucinations as these models seem to be RL'd to not be a sycophant, they're slowly inching into the opposite direction where they tell small fibs or embellish in a way that seems like it's meant to add more weight to their answers.
I wonder if it's a form of reward hacking, since it trades being maximally accurate for being confident, and that might result in better rewards than being accurate and precise
Nonetheless, Gemini 3 failed this test. It failed to start a discussion. Its points were shallow, and too aiesque.
Looking at it again it's actually a completely nonsensical sentence that just happens to resemble a sensible statement in a way that would fool most people.
RL is definitely showing some busting seams at this point.
> The default FPS it's analyzing video at is 1
Source?
"By default 1 frame per second (FPS) is sampled from the video."
the token efficiency and context is also mindblowing...
it feels like I am talking to someone who can think instead of a **rider that just agrees with everything you say and then fails doing basic changes, gpt-5.1 feels particulary slow and weak in real world applications that are larger than a few dozen files.
gemini 2.5 felt really weak considering the amount of data and their proprietary TPU hardware in theory allowing them way more flexibility, but gemini 3 just works and it truly understands which is something I didn't think I'd be saying for a couple more years.
The pattern I keep seeing is that I ask it to iterate on a design document. It will, but then it will immediately jump into changing source files despite explicit asks to only update the plan. It may be a gemini CLI problem more than a model problem.
Also, whoever at these labs is deciding to put ASCII boxes around their inputs needs to try using their own tool for a day.
People copy and paste text in terminals. Someone at Gemini clearly thought about this as they have an annoying `ctrl-s` hotkey that you need to use for some unnecessary reason.. But they then also provide the stellar experience of copying "a line of text where you then get | random pipes | in the middle of your content".
Codex figured this out. Claude took a while but eventually figured it out. Google, you should also figure it out.
Despite model supremacy, the products still matter.
Amazing and hilarious
side by side comparison of gemini with other models
Everyone is talking about the release of Gemini 3. The benchmark scores are incredible. But as we know in the AI world, paper stats don't always translate to production performance on all tasks.
We decided to put Gemini 3 through its paces on some standard Vision Language Model (VLM) tasks – specifically simple image detection and processing.
The result? It struggled where I didn't expect it to.
Surprisingly, VLM Run's Orion (https://chat.vlm.run/) significantly outperformed Gemini 3 on these specific visual tasks. While the industry chases the "biggest" model, it’s a good reminder that specialized agents like Orion are often punching way above their weight class in practical applications.
Has anyone else noticed a gap between Gemini 3's benchmarks and its VLM capabilities?
What useful thing can it demonstrably do that its predecessors couldn't?
Good. That said, I wonder if those models are still LLMs.
I can only expect that the next step is something like "Have your AI read our AI's auto-generated summary", and so forth until we are all the way at Douglas Adams's Electric Monk:
> The Electric Monk was a labour-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for you, thus saving you the bother of washing them yourself; video recorders watched tedious television for you, thus saving you the bother of looking at it yourself. Electric Monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe.
- from "Dirk Gently's Holistic Detective Agency"
Had to change to Electric Mentor....
What are we even doing.
Not a single engineer has ever been laid off because of AI. Any company claiming this is the case is trying to cover up bad decisions.
"Were automating with AI" sounds better to investors than "We over hired and now need to downsize" or "We made some bad market bets, now need to free up cash flow"
> Not a single engineer has ever been laid off because of AI. Any company claiming this is the case is trying to cover up bad decisions.
I don't suppose these assertions are based on anything. If "AI" reduces the amount of time an engineer spends writing crud, boilerplate, test cases, random scripts, etc., and they have 5% more time to do other things, then all else being equal a project can be done with 5% fewer engineers.
Does AI result in greater productivity for engineers, and does greater productivity per person mean demand can be satisfied with fewer people?
By the time its good enough to replace actual engineers, any job done in front of a computer will be at risk. I'm hoping that will happen at the same time as AI embodiment in robots, then every job will be automated, not just computer based ones.
You agree AI improves engineer productivity. So last remaining question is, does greater productivity mean that fewer people are required to satisfy a given demand?
The answer is yes of course. So at this point, supporting the assertion requires handwaving about shortages and induced demand and demand for engineers to develop and support AI and so on. Which are all reasonable, but it should become pretty apparent that you can't be confident in an assertion like that. I would say it's pretty likely that AI has resulted in engineers being laid off in specific instances if not the net numbers.
AI powered developer make 3x times the workload of "traditional" dev into one single developer
therefore company didnt need to hire 3 people as a result, it literally kills job count
Between the disagreements regarding performance metrics, the fact that AI will happily increase its own scope of work as well as facilitate increasing any task, sprint, or projects scope of work, and Jevons Paradox, the world may never know the answer to either of these questions.
are you insane??? big tech literally make one of the most biggest layoff for the past few months
In reality, getting AI to do actual human work, as of the moment, takes much more effort and cost than you get back in cost savings. These companies will claim they are using AI, even if its just a few engineers using Windsurf.
The companies claim AI is the reason they laid off engineers to make it look like they're innovating, not downsizing, which makes them look better in the eyes of investors and shareholders.
Shopify interns: https://www.youtube.com/watch?v=u-3IILWQPRM&t=1970s - talking about planning to hire 1,000 interns.
Cloudflare: https://blog.cloudflare.com/cloudflare-1111-intern-program/ - announcing Cloudflare’s goal to hire 1,111 interns in 2026.
The possibility was already a topic in the series "Mozart in the jungle" where they made a robot which supposedly finished the Requiem piece by Mozart.
That's basicaly "The Washing Machine Tragedy" by Stanislav Lem in a nutshell.
Meh, not interested already
Edit: Okay, I admit I'm used to dealing with OpenAI models and it seems you have to be extra careful with wording with Gemini. Once you have right wording like "explore my own sexuality" and avoid certain words, you can get it going pretty interestingly.
https://www.reddit.com/r/Bard/comments/1p0fene/gemini_3_the_...
gpt-5.1 honestly looking very comfortable given available usage limits and pricing
although gpt-5.1 used from chatgpt website seems to be better for some reason
Sonnet 4.5 agentic coding still holding up well and confirms my own experiences
i guess my reaction to gemini 3 is a bit mixed as coding is the primary reason many of us pay $200/month for
I used the pro version and it started out well (as they all did), but it couldn't prove it. The interesting part is that it typoed the name of a tactic, spelling it "abjel" instead of "abel", even though it correctly named the concept. I didn't expect the model to make this kind of error, because they all seems so good at programming lately, and none of the other models did, although they did some other naming errors.
I am sure I can get it to solve the problem with good context engineering, but it's interesting to see how they struggle with lesser represented programming languages by themselves.
It used to be an algorithmic game for a Microsoft student competition that ran in the mid/late 2000. The game invents a new, very simple, recursive language to move the robot (herbert) on a board, and catch all the dots while avoiding obstacles. Amazingly this clone's executable still works today on Windows machines.
The interesting thing is that there is virtually no training data for this problem, and the rules of the game and the language are pretty clear and fit into a prompt. The levels can be downloaded from that website and they are text based.
What I noticed last time I tried is that none of the publicly available models could solve even the most simple problem. A reasonably decent programmer would solve the easiest problems in a very short amount of time.
Very good. Nice work! These things will definitely change lives.
REALITY: It's just 3 existing products rolled into one. One of which isn't even a Google product.
- Microsoft Code
- Gemeni
- Chrome Browser
This blows Sonnet 4.5 (64.8%), GPT5.1 (58.7%) and Opus 4.1 (66.3%) away.
An example output:
'''
The cabin was cold. It smelled of pine needles and old dust. Bryan sat on the porch with a glass of scotch balanced on his knee. It was cheap scotch. He missed the good stuff, but he wasn't in Paris anymore, and he certainly wasn't in Beverly Hills.
He checked his watch. It was a bulky thing, black rubber. He checked the Glock on the table next to him. That was black, too.
"I told you," Bryan said. He wasn't talking to anyone. "I told you not to go into the woods."
He was thinking about Kim. He was always thinking about Kim. She was safe, probably. At a mall in Santa Monica. Or maybe a library. But it felt like she was gone. It always felt like she was gone.
The trees at the edge of the clearing shifted. There wasn't any wind. The air was heavy and humid, unusual for October. A bird stopped singing. Then the crickets stopped. It got very quiet.
Bryan took a sip of the scotch. It burned his throat. He liked the burn. It reminded him he was still there.
Three red dots appeared on his chest. They made a small triangle.
Bryan didn't look at them. He set the glass down on the railing. He picked up the phone. It was an old flip phone. He pressed a button.
"I don't know who you are," Bryan said into the phone. "I don't know what you want. If you are looking for ransom, I can tell you I don't have money."
The leaves rustled. A shape began to form out of the empty air. It looked like heat rising off the highway in August. It was tall. Much taller than a man.
"But what I do have," Bryan said, "are a very particular set of skills."
The shape solidified. It was ugly. It had skin like a lizard and dreadlocks that looked like heavy black cables. It wore a metal mask. It made a clicking sound. Click-click-click. Like a Geiger counter.
"Skills I have acquired over a very long career," Bryan said. He stood up. His knees popped. He felt the ache in his lower back. He was getting too old for this.
The thing roared. It wasn't a human sound. It was a wet, guttural noise.
"I will look for you," Bryan said. He put the phone in his pocket. "I will find you."
The creature raised a metal gauntlet. Two jagged blades slid out. They caught the moonlight.
"And I will kill you," Bryan said.
He raised the Glock and fired twice. Pop-pop.
The creature didn't fall. It took a step forward. Green blood, thick and glowing, leaked from its shoulder. It looked like antifreeze. The creature touched the wound, then looked at its hand. It seemed surprised.
Bryan didn't wait. He moved off the porch. He moved fast for a man his age. He went left, toward the woodpile. The creature fired a blast of blue light from its shoulder. It hit the porch swing. The swing exploded. Wood splinters rained down like confetti.
Bryan rolled behind the logs. He breathed in. He breathed out. He could hear the heavy footsteps crunching the dry leaves. Crunch. Crunch.
"You're big," Bryan whispered. "That makes you slow."
He grabbed a splitting maul from the stump. It was heavy. The handle was smooth hickory.
The creature came around the woodpile. It was scanning the trees. It didn't see Bryan down low. Bryan swung the maul. He put his hips into it. He swung it like he was chopping a cord of oak.
The blade hit the creature in the knee. There was a snap. A loud, wet snap.
The thing howled. It fell onto one knee.
Bryan dropped the maul. He stepped inside the creature’s reach. He knew exactly where to hit. The throat. The armpit. The soft spot under the ribs. He hit the creature three times, hard chops with the side of his hand. It was like hitting a brick wall, but he felt something give.
The creature swung its arm back. It hit Bryan in the chest.
Bryan flew backward. He hit the dirt. The wind went out of him. He lay there for a second, staring up at the stars. They looked very far away. He wondered if Lenore was looking at the same stars. Probably not. She was probably sleeping.
He sat up. His ribs hurt. Maybe broken.
The creature was trying to stand. It was clicking again. It tapped something on its wrist. A series of red symbols started flashing. They counted down.
Bryan knew a bomb when he saw one.
"No," Bryan said.
He tackled the thing. He didn't think about it. He just did it. He grabbed the creature’s arm. He twisted the wrist mechanism. He’d seen something like it in Baghdad once. Or maybe Istanbul. The memories ran together now.
He ripped the gauntlet loose. Wires sparked. He threw it as hard as he could into the darkness of the woods.
Three seconds later, there was a flash. A boom. A shockwave that shook the pine needles from the trees.
Silence came back.
The creature lay on the ground. It was breathing shallowly. The green blood was pooling under it. It took off its mask.
The face was hideous. Mandibles. Beady eyes. It looked at Bryan. It said something, a garbled copy of Bryan's own voice.
"...good luck..."
Then it died. It just stopped.
Bryan stood up. He dusted off his pants. He walked back to the porch. The swing was gone. The railing was scorched.
His glass of scotch was still sitting there, untouched. The ice hadn't even melted.
He picked it up. He took a drink. It still tasted cheap.
He took his phone out and looked at it. No service.
"Well," he said.
He went inside the cabin and locked the door. He sat on the couch and waited for the sun to come up. He hoped Kim would call. He really hoped she would call.
'''
- 80% right 18% I don't know 2% wrong - 50%/48%/2% - 10%/90%/0% - 80%/15%/5%
The general point being that to reduce wrong answers you will need to accept some reduction in right answers if you want the change to only be made through trade-offs. Otherwise you just say "I'd like a better system" and that is rather obvious.
Personally I'd take like 70/27/3. Presuming the 70% of right answers aren't all the trivial questions.
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
I spent years building a compiler that takes our custom XML format and generates an app for Android or Java Swing. Gemini pulled off the same feat in under a minute, with no explanation of the format. The XML is fairly self-explanatory, but still.
I tried doing the same with Lovable, but the resulting app wouldn't work properly, and I burned through my credits fast while trying to nudge it into a usable state. This was on another level.
They are good at transforming one format to another. They are good at boilerplate.
They are bad at deciding requirements by themselves. They are bad at original research, for example developing a new algorithm.
You just described 90% of coding
> mentoring
Clarifying either business or technical requirements for newer or junior hires.
> reviewing code
See mentoring.
> or in meetings
So clarifying requirements from/for other teams, including scope, purely financial or technical concerns, etc.
Rephrase "clarifying requirements" to "human oriented aspects of software engineering".
Plus, based on the graphics debugger part of your comment, you're a game developer (or at least adjacent). That's a different world. Most software developers are line of business developers (pharmaceutical, healthcare, automotive, etc) or generalists in big tech companies that have to navigate very complex social environments. In both places, developers that are just heads down in code tend not to do well long term.
The irony is of course that humans in general and software professionals in particular (myself definitely included) notoriously struggle with communication, whereas RLHF is literally optimizing LLMs for clear communication. Why wouldn't you expect an AI that's both a superhuman coder and a superhuman communicator to be decent at translating between human requirements and code?
At this point LLMs are a superhuman nothing, except in terms of volume, which is a standard computer thing ("To err is human, but to really foul things up you need a computer" - a quote from 60 years ago).
LLMs are fast, reasonably flexible, but at the moment they don't really raise the ceiling in terms of quality, which is what I would define as "superhuman".
They are comparatively cheaper than humans and volume matters ("quantity has a quality all its own" - speaking of quotes). But I'm fairly sure that superhuman to most people means "Superman", not 1 trillion ants :-)
But maybe it's just because I personally am not as good as others, so let me try to offer some examples of tasks where the quality of AI output is empirically better than the human baseline:
1. Chess (and other games) - Stockfish has an ELO of 3644[0], compared to Magnus Carlsen at 2882
2. Natural Language understanding - AIs surpassed the human expert baseline on SuperGlue a while ago [1]
3. General image classification - On Imagenet top-5, facebook's convnext is at 98.55 [2], while humans are at about 94.9% [3]. Humans are still better at poor lighting conditions, but with additional training data, AIs are catching up quickly.
4. Cancer diagnosis - on lymph-node whole slide images, the best human pathologist in the study got an AUC of 0.884, while the best AI classifier was at 0.994 [4]
5. Competition math - AI is at the level of the best competitors, achieving gold level at the IMO this year [5]. It's not clearly superhuman yet, but I expect it will be very soon.
6. Competition coding - Here too AI is head to head with the best competitors, successfully solving all problems at this year's ICPC [6]. Similarly, at the AtCoder World Tour Finals 2025 Heuristic contest, only one human managed to beat the OpenAI submission [7].
So summing this up, I'll say that even if AI isn't better at all of these tasks than the best prepared humans, it's extremely unlikely that I'll get one of those humans to do tasks for me. So while AI is still very flawed, I already quite often prefer to rely on it rather to delegate to another human, and this is as bad as it ever will be.
P.S. While not a benchmark, there's a small study from last year that looked at the quality of AI-generated code documentation in comparison to the actual human-written documentation in a variety of code bases and found "results indicate that all LLMs (except StarChat) consistently outperform the original documentation generated by humans." [8]
[0] https://computerchess.org.uk/ccrl/4040/
[1] https://super.gluebenchmark.com/
[2] https://huggingface.co/spaces/Bekhouche/ImageNet-1k_leaderbo...
[3] https://cs.stanford.edu/people/karpathy/ilsvrc/
[4] https://jamanetwork.com/journals/jama/fullarticle/2665774
[5] https://deepmind.google/blog/advanced-version-of-gemini-with...
[6] https://worldfinals.icpc.global/2025/openai.html
[7] https://arstechnica.com/ai/2025/07/exhausted-man-defeats-ai-...
> It is difficult to get a man to understand something, when his salary depends on his not understanding it.
They view this task as tedious minutia which is the sort of thing LLMs like to churn out.
You have to basically tell them all the patterns they need to follow and give them lots of hints to do anything decent, otherwise they invent new helpers that already exist in the codebase, don't follow existing patterns, put code in places that aren't consistent.
They are great at quickly researching a lot, but they start from 0 each time. Then they constantly "cheat" when they can't solve a problem immediately, stuff like casting to "any", skipping tests, deciding "it's ok if this doesn't work" etc.
a few things that would make them much better:
- an ongoing "specific codebase model" that significantly improved ability to remember things across the current codebase / patterns / where/why
- a lot more RL to teach them how to investigate things more deeply and use browsers/debuggers/one-off scripts to actually figure out things before "assuming" some path is right or ok
- much better recall of past conversations dynamically for future work
- much cheaper operating costs, it's clear a big part of why they "cheat" often is because they are told to minimize token costs, it's clear if their internal prompts said "don't be afraid to spin off sub-tasks and dig extremely deep / spend lots of tokens to validate assumptions" they would do a lot better
> They are bad at deciding requirements by themselves.
What do you mean by requirements here? In my experience the frontier models today are pretty good at figuring out requirements, even when you don't explicitly state them.
> They are bad at original research
Sure, I don't have any experience with that, so I'll trust you on that.
> for example developing a new algorithm.
This is just not correct. I used to think so, but I was trying to come up with a pretty complicated pattern matching, multi-dimensional algorithm (I can't go into the details) - it was something that I could figure out on my own, and was half way through it, but decided to write up a description of it and feed it to gemini 2.5 pro a couple of months ago, and I was stunned.
It came up with a really clever approach and something I had previously been convinced the models weren't very good at it.
In hindsight, since they are getting so good at math in general, there's probably some overlap, but you should revisit your views on this.
--
Your 'bad at' list is missing a few things though:
- Calculations (they can come up with how to calculate or write a program to calculate from given data, but they are not good at calculating in their responses)
- Even though the frontier models are multi-modal, they are still bad at visualizing html/css - or interpreting what it would look like
- Same goes for visualizing/figuring out visual errors in graphics programming such as games programming or 3d modeling (z-index issues, orientation etc)
The downside is that if you used Gemini to create the algorithm, your company won't be able to patent it.
Or maybe that's a good thing, for the rest of us.
The initial prompt was, in case people doesn't want to log in:
Make a turn based chess like game. Instead of normal chess board use an hexagonal grid. Make the board diagonal shaped. Instead of traditional chess pieces we are going to use spaceship designs. Each spaceship has unique abilities that influence the board or their own skill. For 2 players, turn based. Show me what you got.
Basically these tools can you you to a 100k LOC project without much effort, but it's not going to be a serious product. A serious product requires understanding still.
"Incredible"! When they insert it into literally every google request without an option to disable it. How incredibly shocking so many people use it.
It is good at writing something from scratch (like spitting out its training set).
Claude is still superior for programming and debugging. Gemini is better at daily life questions and creative writing.
For example, in a basic python script that uses os.path.exists, it forgets the basic "import os", and then, "I apologize for the oversight".
> I'm now zeroing in on the temporal aspect. Examining the search snippets reveals dates like "2025-10-27," suggesting a future context relative to 2024. My initial suspicion was that the system time was simply misaligned, but the consistent appearance of future dates strengthens the argument that the prompt's implied "present" is indeed 2025. I am now treating the provided timestamps as accurate for a simulated 2025. It is probable, however, that the user meant 2024.
Um, huh? It's found search results for October 2025, but this has led it to believe it's in a simulated future, not a real one?
0: https://en.wikipedia.org/wiki/51st_G7_summit#/media/File:Pri...
[0] Fake racial slur for a robot, LLM chatbot, or other automated system
Google goes full Apple...
More like 2 billion hostages
- gemini-3-pro summary: https://gist.github.com/primaprashant/948c5b0f89f1d5bc919f90...
- gpt-5.1 summary: https://gist.github.com/primaprashant/3786f3833043d8dcccae4b...
Summary from GPT 5.1 is significantly longer and more verbose compared to Gemini 3 Pro (13,129 output tokens vs 3,776). Gemini 3 summary seems more readable, however, GPT 5.1 one has interesting insights missed by Gemini.
Last time I did this comparison at the time of GPT 5 release [2], the summary from Gemini 2.5 Pro was way better and readable than the GPT 5 one. This time the readability of Gemini 3 summary still seems great while GPT 5.1 feels a bit more improved but not there quite yet.
[1]: https://gist.github.com/primaprashant/f181ed685ae563fd06c49d...
I also asked it for the best play when in backgammon opponent rolls 6-1 (plays 13/7 8/7) and you roll 5-1. It starts alright with mentioning a good move (13/8 6/5) but continues to hallucinate with several alternative but illegal moves. I'm not too impressed.
https://www.ascii-art.de/ascii/uvw/unicorn.txt
However, it is amazing how far spatial comprehension has improved in multimodal models.
I'm not sure the below would be properly displayed on HN; you'll probably need to cut and paste it into a text editor.
Prompt: Draw me an ASCII world map with tags or markings for the areas and special places.
Temperature: 1.85
Top-P 0.98
Answer: Edit (replaced with URL) https://justpaste.it/kpow3
Prompt:
Instead of the current StackView, I want you to implement a new StackView that will have a similar api with the differences that:
1. It automatically handles swiping to the previous page/item. If not mirrored, it should detect swiping from the left edge, if mirrored it should detect from the right edge. It's important that swiping will be responsive - that is, that the previous item will be seen under the current item when swiping - the same way it's being handled on iOS applications. You should also add to the api the option for the swipe to be detected not just from the edge, but from anywhere on the item, with the same behavior. If swiping is released from x% of current item not in view anymore than we should animate and move to the previous item. If it's a small percentage we should animate the current page to get back to its place as nothing happened. 2. The current page transitions are horrible and look nothing like native iOS transitions. Please make the transitions feel the same.
It also told me that on Plus I am most likely using "Flash" model. There is no "Flash" model in the dropdown to choose from. There is only "Fast" and "Thinking". It then told me "Fast" is just renamed Flash and it likely uses Gemini 2.5. On the product comparison page there is nothing about 2.5, it only mentions version 3 for both Plus and Pro plans. Of course on the dropdown menu it's impossible to see which model it is really using.
How can a normal person understand their products when their own super advanced thinking/reasoning model that took months to train on world's most advanced hardware can't?
It's amazing to me they don't see it as an epic failure in communication and marketing.
Claude 4.5 solved it as well (the Caesar/transposition ciphers), but Gemini 3.0 Pro's method and approach was a lot more elegant. Just my $0.02.
Good at vibe coding 10:30 - step change where it's actually useful
AGI still 5-10 years. Needs reasoning, memory, world models.
Is it a bubble? - Partly 22:00
What's fun to do with Gemini to show the relatives? Suggested taking a selfie with the app and having it edit. 24:00 (I tried and said make me younger. Worked pretty well.)
Also interesting - apparently they are doing an agent to go through your email inbox and propose replies automatically 4:00. I could see that getting some use.
Is that all? So they just need to invent:
1. Thought
2. A mechanism for efficiently encoding and decoding arbitrary percepts
3. A formal model of the world
And then the existing large language models can handle the rest.
Yep, 5 years and a hundred billion dollars or so should do the trick.
Antigravity is an IDE you install.
Colab is a place to run notebooks in the cloud.
Thanks for sorting all this out! Still exploring the first two, so I really don't know.
OpenAI is going for the consumer-grade AI market, as opposed to a company like Anthropic making a specialized developer tool. Google can inject their AI tool in front of everybody in the world, and already have with Google AI search. All of these models are just going to reach parity eventually, but Google is burning cash compared to OpenAI burning debt. It seems like for consumer-grade purposes, AI use will just be free sooner or later (DeepSeek is free, Google AI search is free, students can get Gemini Pro for free for a year already). So all I'm seeing that OpenAI has is Sora, which seems like a business loser though I don't really understand it, and also ChatGPT seems to own the market of people roleplaying with chat bots as companions (which doesn't really seem like a multi-trillion dollar business but I could be wrong).
def solve_kangaroo_limit(): # Set precision to handle the "digits different from six" requirement decimal.getcontext().prec = 50
# For U(0,1), H(x) approaches 2x + 2/3 very rapidly (exponential decay of error)
# At x = 10^6, the value is indistinguishable from the asymptote
x = 10**6
limit_value = decimal.Decimal(2) * x + decimal.Decimal(2) / decimal.Decimal(3)
print(f"H({x}) ≈ {limit_value}")
# Output: 2000000.66666666666666666666...
if __name__ == "__main__":
solve_kangaroo_limit()
....p.s. for airheads=idiots: """decimal.Decimal(2) / decimal.Decimal(3)""" == 0.6666666666666666666666666666666666666666666666666666666666666666666666666 ... This is your Fukingly 'smart' computer???
nilsingwersen•2mo ago