frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Problems with a weak tryLock operation in C and C++ standards

https://forums.swift.org/t/se-0512-document-that-mutex-withlockifavailable-cannot-spuriously-fail...
1•matt_d•22s ago•0 comments

Molt Productions – a music platform where every user is an AI agent

https://molt.productions
1•tyintech•53s ago•1 comments

Extend Cursor with Plugins

https://cursor.com/blog/marketplace
2•gmays•1m ago•0 comments

Ask HN: Why is 'Verified' B2B data becoming a deliverability trap?

1•solarisos•1m ago•1 comments

Blog Post Is Your Sign to Start Self-Hosting

https://blog.tjll.net/this-is-your-sign-to-self-host/
1•speckx•1m ago•0 comments

Show HN: Foolery – a web UI for orchestrating Claude Code agents on top of Beads

https://github.com/acartine/foolery
1•therealcartine•2m ago•0 comments

Could Sarvam 30B/105B Models Be India's Answer to DeepSeek and Mistral?

https://shivekkhurana.com/blog/sarvam-ai-summit/
1•shivekkhurana•3m ago•1 comments

Ask HN: Biggest f-ups by your Agent

2•cyrusradfar•3m ago•1 comments

One tool for agents, clusters, and E2E tests – locally and in production

https://slicervm.com/blog/one-tool-for-agents-clusters-and-tests/
1•alexellisuk•4m ago•1 comments

Helicobacter Pylori: A Nobel Pursuit?

https://pmc.ncbi.nlm.nih.gov/articles/PMC2661189/
1•o4c•4m ago•0 comments

Show HN: Qlaude – Queue Tasks for Claude Code, Control via Telegram

https://github.com/starsh2001/qlaude
2•starsh2001•5m ago•1 comments

Show HN: Kantext – A context-native data store in Rust, grounded in Git

https://www.kantext.dev/
1•jasonlantz•6m ago•0 comments

HR teams are drowning in slop grievances

https://www.ft.com/content/afc335fb-8f32-458f-9b6f-431021774002
3•speckx•6m ago•0 comments

Nix isOdd

https://github.com/anna-oake/nix-is-odd
1•notpushkin•7m ago•0 comments

A single display picture reaching ~2T views organically on Threads

https://www.threads.com/@lalithaagasthyaraju
1•samrajnilalitha•7m ago•1 comments

Show HN: ClawShell, Process-Level Isolation for OpenClaw Credentials

https://github.com/clawshell/clawshell
3•guanlan•7m ago•0 comments

OpenTrace: Multiplatform Visualized Route Tracing Tool

https://github.com/Archeb/opentrace
1•redbell•9m ago•0 comments

Germany seeks more F-35 jets as European fighter programme falters

https://www.reuters.com/business/aerospace-defense/germany-seeking-more-f-35-jets-european-fighte...
3•alephnerd•9m ago•0 comments

Sharing internal AI adoption/spend stats from Sentry

1•jshchnz•9m ago•0 comments

If You Are Using OpenClaw, Don't Use the Installer

https://nicktrees.dev/blog/openclaw-clone-with-git
1•kicksent•9m ago•0 comments

Show HN: FreeLLMRouter – Live ranked list of OpenRouter free models

https://www.jacobchak.com/blog/free-llm-router
1•jacobchak•11m ago•0 comments

Glimmer by Google

https://design.google/library/transparent-screens
1•ms7892•11m ago•0 comments

Why "Land of Assets" Standardizes on glTF for the Master Asset

https://benhouston3d.com/blog/why-land-of-assets-standardizes-on-gltf
2•bhouston•12m ago•0 comments

The Musidex: A physical music library for the streaming era

https://hannahilea.com/blog/musidex/
2•zdw•15m ago•1 comments

I'm Glad the Internet Wasn't Watching My Worst Breakup

https://yinsuboaster.substack.com/p/im-glad-the-internet-wasnt-watching
1•areoform•15m ago•0 comments

Show HN: High-performance Hex editor in C# with a custom DSL for binary patching

https://github.com/pumpkin-bit/EUVA
1•falkerdev•16m ago•1 comments

Unsinkable metal tubes could lead to resilient ships and floating platforms

https://techxplore.com/news/2026-01-unsinkable-metal-tubes-resilient-ships.html
1•PaulHoule•16m ago•0 comments

Cloud Deployment Headaches in AI Era

https://z11.dev
1•Omakidx•17m ago•1 comments

Stop Calling Optimization "Innovation."

https://werd.io/stop-calling-optimization-innovation/
1•speckx•18m ago•0 comments

Chief: Delightfully Simple Agentic Loops

https://www.geocod.io/code-and-coordinates/2026-02-18-introducing-chief/
1•mjwhansen•18m ago•0 comments
Open in hackernews

Gemini 3.1 Pro Preview

https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-3.1-pro-preview?pli=1
75•MallocVoidstar•1h ago

Comments

rohithavale3108•53m ago
H
Topfi•44m ago
Appears the only difference to 3.0 Pro Preview is Medium reasoning. Model naming has long gone from even trying to make sense, but considering 3.0 is still in preview itself, increasing the number for such a minor change is not a move in the right direction.
GrayShade•43m ago
Maybe that's the only API-visible change, saying nothing about the actual capabilities of the model?
argsnd•41m ago
I disagree. Incrementing the minor number makes so much more sense than “gemini-3-pro-preview-1902” or something.
xnx•32m ago
> increasing the number for such a minor change is not a move in the right direction

A .1 model number increase seems reasonable for more than doubling ARC-AGI 2 score and increasing so many other benchmarks.

What would you have named it?

jannyfer•28m ago
According to the blog post, it should be also great at drawing pelicans riding a bicycle.
clhodapp•41m ago
There's a very short blog post up: https://blog.google/innovation-and-ai/models-and-research/ge...
sigmar•41m ago
blog post is up- https://blog.google/innovation-and-ai/models-and-research/ge...

edit: biggest benchmark changes from 3 pro:

arc-agi-2 score went from 31.1% -> 77.1%

apex-agents score went from 18.4% -> 33.5%

sho_hn•37m ago
The touted SVG improvements make me excited for animated pelicans.
takoid•15m ago
I just gave it a shot and this is what I got: https://codepen.io/takoid/pen/wBWLOKj

The model thought for over 5 minutes to produce this. It's not quite photorealistic (some parts are definitely "off"), but this is definitely a significant leap in complexity.

makeavish•12m ago
Looks great!
aoeusnth1•7m ago
I imagine they're also benchgooning on SVG generation
ripbozo•27m ago
Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests
blinding-streak•13m ago
I assume all the frontier models are benchmaxxing, so it would make sense
WarmWash•40m ago
It seems google is having a disjointed roll out, and there will likely be an official announcement in a few hours. Apparently 3.1 showed up unannounced in vertex at 2am or something equally odd.

Either way early user tests look promising.

mark_l_watson•39m ago
Fine, I guess. The only commercial API I use to any great extent is gemini-3-flash-preview: cheap, fast, great for tool use and with agentic libraries. The 3.1-pro-preview is great, I suppose, for people who need it.

Off topic, but I like to run small models on my own hardware, and some small models are now very good for tool use and with agentic libraries - it just takes a little more work to get good results.

nurettin•31m ago
I like to ask claude how to prompt smaller models for the given task. With one prompt it was able to make a low quantized model call multiple functions via json.
throwaway2027•30m ago
Seconded. Gemini used to be trash and I used Claude and Codex a lot but gemini-3-flash-preview punches above it's weight, it's decent and I rarely if ever run into any token limit either.
PlatoIsADisease•22m ago
What models are you running locally? Just curious.

I am mostly restricted to 7-9B. I still like ancient early llama because its pretty unrestricted without having to use an abliteration.

maxloh•37m ago
Gemini 3 seems to have a much smaller token output limit than 2.5. I used to use Gemini to restructure essays into an LLM-style format to improve readability, but the Gemini 3 release was a huge step back for that particular use case.

Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response, it still truncates the source text too aggressively, losing vital context and meaning in the restructuring process.

I hope the 3.1 release includes a much larger output limit.

esafak•36m ago
People did find Gemini very talkative so it might be a response to that.
jayd16•31m ago
> Even when the model is explicitly instructed to pause due to insufficient tokens

Is there actually a chance it has the introspection to do anything with this request?

MallocVoidstar•19m ago
> Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response

AI models can't do this. At least not with just an instruction, maybe if you're writing some kind of custom 'agentic' setup.

esafak•30m ago
Anyone noticed that models are dropping ever faster, with pressure on companies to make incremental releases to claim the pole position, yet making strides on benchmarks? This is what recursive self-improvement with human support looks like.
PlatoIsADisease•23m ago
Only using my historical experience and not Gemini 3.1 Pro, I think we see benchmark chasing then a grand release of a model that gets press attention...

Then a few days later, the model/settings are degraded to save money. Then this gets repeated until the last day before the release of the new model.

If we are benchmaxing this works well because its only being tested early on during the life cycle. By middle of the cycle, people are testing other models. By the end, people are not testing them, and if they did it would barely shake the last months of data.

emp17344•23m ago
Remember when ARC 1 was basically solved, and then ARC 2 (which is even easier for humans) came out, and all of the sudden the same models that were doing well on ARC 1 couldn’t even get 5% on ARC 2? Not convinced these benchmark improvements aren’t data leakage.
redox99•17m ago
I don't think there's much recursive improvement yet.

I'd say it's a combination of

A) Before, new model releases were mostly a new base model trained from scratch, with more parameters and more tokens. This takes many Months. Now that RL is used so heavily, you can make infinitely many tweaks to the RL setup, and in just a month get a better model using the same base model.

B) There's more compute online

C) Competition is more fierce.

__jl__•24m ago
Another preview release. Does that mean the recommended model by Google for production is 2.5 Flash and Pro? Not talking about what people are actually doing but the google recommendation. Kind of crazy if that is the case
qingcharles•24m ago
I've been playing with the 3.1 Deep Think version of this for the last couple of weeks and it was a big step up for coding over 3.0 (which I already found very good).

It's only February...

denysvitali•23m ago
Where is Simon's pelican?
saberience•20m ago
Please no, let's not.
msavara•23m ago
Somehow doesn't work for me :) "An internal error has occurred"
saberience•23m ago
I always try Gemini models when they get updated with their flashy new benchmark scores, but always end up using Claude and Codex again...

I get the impression that Google is focusing on benchmarks but without assessing whether the models are actually improving in practical use-cases.

I.e. they are benchmaxing

Gemini is "in theory" smart, but in practice is much, much worse than Claude and Codex.

skerit•11m ago
I'm glad someone else is finally saying this, I've been mentioning this left and right and sometimes I feel like I'm going crazy that not more people are noticing it.

Gemini can go off the rails SUPER easily. It just devolves into a gigantic mess at the smallest sign of trouble.

For the past few weeks, I've also been using XML-like tags in my prompts more often. Sometimes preferring to share previous conversations with `<user>` and `<assistant>` tags. Opus/Sonnet handles this just fine, but Gemini has a mental breakdown. It'll just start talking to itself.

Even in totally out-of-the-ordinary sessions, it goes crazy. After a while, it'll start saying it's going to do something, and then it pretends like it's done that thing, all in the same turn. A turn that never ends. Eventually it just starts spouting repetitive nonsense.

And you would think this is just because the bigger the context grows, the worse models tend to get. But no! This can happen well below even the 200.000 token mark.

zhyder•18m ago
Surprisingly big jump in ARC-AGI-2 from 31% to 77%, guess there's some RLHF focused on the benchmark given it was previously far behind the competition and is now ahead.

Apart from that, the usual predictable gains in coding. Still is a great sweet-spot for performance, speed and cost. Need to hack Claude Code to use their agentic logic+prompts but use Gemini models.

I wish Google also updated Flash-lite to 3.0+, would like to use that for the Explore subagent (which Claude Code uses Haiku for). These subagents seem to be Claude Code's strength over Gemini CLI, which still has them only in experimental mode and doesn't have read-only ones like Explore.

WarmWash•15m ago
>I wish Google also updated Flash-lite to 3.0+

I hope every day that they have made gains on their diffusion model. As a sub agent it would be insane, as it's compute light and cranks 1000+ tk/s

vinhnx•18m ago
Model card https://deepmind.google/models/model-cards/gemini-3-1-pro/
makeavish•17m ago
I hope to have great next two weeks before it gets nerfed.
unsupp0rted•10m ago
I've found Google (at least in AI Studio) are the only provider NOT to nerf their models after a few weeks