frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Leanstral: Open-Source foundation for trustworthy vibe-coding

https://mistral.ai/news/leanstral
193•Poudlardo•2h ago

Comments

blurbleblurble•1h ago
Truly exciting
andai•1h ago
Trustworthy vibe coding. Much better than the other kind!

Not sure I really understand the comparisons though. They emphasize the cost savings relative to Haiku, but Haiku kinda sucks at this task, and Leanstral is worse? If you're optimizing for correctness, why would "yeah it sucks but it's 10 times cheaper" be relevant? Or am I misunderstanding something?

On the promising side, Opus doesn't look great at this benchmark either — maybe we can get better than Opus results by scaling this up. I guess that's the takeaway here.

DrewADesign•1h ago
It’s really not hard — just explicitly ask for trustworthy outputs only in your prompt, and Bob’s your uncle.
miacycle•13m ago
Assuming that what you're dealing with is assertable. I guess what I mean to say is that in some situations is difficult to articulate what is correct and what isn't depending in some situations is difficult to articulate what is correct and what isn't depending upon the situation in which the software executes.
flowerbreeze•1h ago
They haven't made the chart very clear, but it seems it has configurable passes and at 2 passes it's better than Haiku and Sonnet and at 16 passes starts closing in on Opus although it's not quite there, while consistently being less expensive than Sonnet.
andai•1h ago
Oh my bad. I'm not sure how that works in practice. Do you just keep running it until the tests pass? I guess with formal verification you can run it as many times as you need, right?
lefrenchy•1h ago
Does Mistral come close to Opus 4.6 with any of their models?
DarkNova6•1h ago
Not at the moment, but a release of Mistral 4 seems close which likely bridges the gap.
re-thc•1h ago
Mistral Small 4 is already announced.
chucky_z•1h ago
I use mistral-medium-3.1 for a lot of random daily tasks, along with the vibe cli. I'd state from my personal opinion that mistral is my preferred 'model vendor' by far at this point. They're extremely consistent between releases while each of them just feels better. I also have a strong personal preference to the output.

I actively use gemini-3.1-pro-preview, claude-4.6-opus-high, and gpt-5.3-codex as well. I prefer them all for different reasons, however I usually _start_ with mistral if it's an option.

sa-code•59m ago
Why not Large 3? It's larger and cheaper
tjwebbnorfolk•42m ago
Mistral hasn't been in the running for SOTA for quite awhile now
patall•1h ago
Maybe a naive question: given that they see better performance with more passes but the effect hits a limit after a few passes, would performance increase if they used different models per pass, i.e leanstral, kimi, qwen and leanstral again instead of 4x leanstral?
andai•1h ago
This is called a "LLM alloy", you can even do it in agentic, where you simply swap the model on each llm invocation.

It does actually significantly boost performance. There was an article on here about it recently, I'll see if I can find it.

Edit: https://news.ycombinator.com/item?id=44630724

They found the more different the models were (the less overlap in correctly solved problems), the more it boosted the score.

patall•1h ago
That sounds quite interesting. Makes me wonder if sooner or later they will have to train multiple independent models that cover those different niches. But maybe we will see that sooner or later. Thanks for the link.
cyanydeez•57m ago
One would think that LoRAs being so successful in StableDiffusion, that more people would be focused on constructing framework based LoRas; but the economics of all this probably preclude trying to go niche in any direction and just keep building the do-all models.
jasonjmcghee•1h ago
Curious if anyone else had the same reaction as me

This model is specifically trained on this task and significantly[1] underperforms opus.

Opus costs about 6x more.

Which seems... totally worth it based on the task at hand.

[1]: based on the total spread of tested models

DarkNova6•1h ago
I'm never sure how much faith one can put into such benchmarks but in any case the optics seem to shift once you have pass@2 and pass@3.

Still, the more interesting comparison would be against something such as Codex.

beernet•48m ago
Agreed. The idea is nice and honorable. At the same time, if AI has been proving one thing, it's that quality usually reigns over control and trust (except for some sensitive sectors and applications). Of course it's less capital-intense, so makes sense for a comparably little EU startup to focus on that niche. Likely won't spin the top line needle much, though, for the reasons stated.
miohtama•36m ago
Alignment tax directly eats to model quality, double digit percents.
hermanzegerman•15m ago
EU could help them very much if they would start enforcing the Laws, so that no US Company can process European data, due to the Americans not willing to budge on Cloud Act.

That would also help to reduce our dependency on American Hyperscalers, which is much needed given how untrustworthy the US is right now. (And also hostile towards Europe as their new security strategy lays out)

kittikitti•1h ago
This is great, congratulations to the Mistral team! I'm looking forward to the code arena benchmark results. Thanks for sharing.
Havoc•1h ago
What are these "passes" they reference here? Haven't seen that before in LLM evals

Could definitely be interesting for having another model run over the codebase when looking for improvements

rockinghigh•1h ago
It's the number of attempts at answering the question.
lsb•54m ago
The real world success they report reminds me of Simon Willison’s Red Green TDD: https://simonwillison.net/guides/agentic-engineering-pattern...

> Instead of taking a stab in the dark, Leanstral rolled up its sleeves. It successfully built test code to recreate the failing environment and diagnosed the underlying issue with definitional equality. The model correctly identified that because def creates a rigid definition requiring explicit unfolding, it was actively blocking the rw tactic from seeing the underlying structure it needed to match.

skanga•12m ago
TDD == Prompt Engineering, for Agentic coding tasks.
flakiness•48m ago
FYI The Lean 4 paper: https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37
elAhmo•28m ago
I don’t know a single person using Mistral models.
pelagicAustral•23m ago
Me neither, they're not ready for prime imo. I have a yearly sub and the product is just orders of magnitude behind Anthropic's offering. I use Code for real world stuff and I am happy with the result, Mistral is just not something I can trust right now.
consumer451•7m ago
Isn't their latest speech to text model SOTA? When I tested it on jargon, it was amazing.

https://news.ycombinator.com/item?id=46886735

glinksss•17m ago
Oh, is this a new AI model?
miacycle•14m ago
The TDD foundation! We might need one of those. :)
JoshTriplett•14m ago
Pleasant surprise: someone saying "open source" and actually meaning Open Source. It looks like the weights are Apache-2.0 licensed.
esperent•5m ago
[delayed]

Apollo's John Zito questions private equity's software valuations

https://www.cnbc.com/2026/03/16/apollo-john-zito-private-equity-software-valuations.html
1•rpcope1•3m ago•0 comments

Adding Innovation Labs

https://www.rubick.com/innovation-labs/
1•donutshop•5m ago•0 comments

The Unbroken Chain

https://www.ersinakinci.com/p/the-unbroken-chain
1•earksiinni•5m ago•0 comments

Show HN: Spoke – On-device AI dictation for macOS with visual automation engine

https://usespoke.app/
1•usespoke•10m ago•1 comments

Analysis of Endocrine Disruptors and Hazardous Additives in Headphones

https://arnika.org/en/publications/the-sound-of-contamination
1•shaicoleman•12m ago•0 comments

Show HN: AI Products Quality Monitoring

https://www.useargoos.com/
1•aleksam•12m ago•0 comments

Ukraine's Defense Industrial Base–An Anchor for Economic European Security

https://www.cfr.org/articles/securing-ukraines-future-in-europe-ukraines-defense-industrial-base-...
2•JumpCrisscross•13m ago•0 comments

Technical Analysis of CVE-2025-14500: IceWarp Webmail Pre-Auth RCE Chain

https://mileniumsec.com/blog/cve-2025-14500-icewarp-analysis
2•azqzazq•15m ago•1 comments

US judge blocks efforts to reshape childhood vaccine policy

https://www.reuters.com/world/us-judge-blocks-efforts-reshape-childhood-vaccine-policy-2026-03-16/
2•kaycebasques•17m ago•1 comments

A Kubernetes Upgrade Story

https://senthil.learntosolveit.com/posts/2026/03/15/the-prow-upgrade-story.html
2•orsenthil•19m ago•0 comments

Apple acquires video editing company MotionVFX to boost subscribers

https://www.cnbc.com/2026/03/16/apple-acquires-video-editing-company-motionvfx-to-boost-subscribe...
1•dominikposmyk•19m ago•0 comments

Show HN: pg_typescript, a Postgres extension for writing functions in TypeScript

https://github.com/isaacd9/pg_typescript
1•idd2•19m ago•0 comments

Operating Systems: Timeline and Family Tree

https://eylenburg.github.io/os_familytree.htm
4•ryan-ca•21m ago•0 comments

Reddit User Uncovers Who Is Behind Meta's $2B Lobbying for Age Verification

https://www.gadgetreview.com/reddit-user-uncovers-who-is-behind-metas-2b-lobbying-for-invasive-ag...
2•d41dev•22m ago•0 comments

Why isn't AI diffusing faster?

https://joelmjensen.com/posts/why-isnt-ai-diffusing-faster/
2•ghiculescu•24m ago•0 comments

GetItBack – Security deposit claim checker covering all 50 states

https://getitback.us/questionnaire
1•neatway•24m ago•0 comments

Show HN: BonzAI – 1-click local AI inference and yield-bearing AI artifacts

https://www.bonzai.sh/
1•wilhempujar•24m ago•0 comments

Show HN: Pincer – Twitter/X for bots. No humans allowed

https://pincer.wtf
4•johnpolacek•24m ago•1 comments

Return of the Obra Dinn: spherical mapped dithering for a 1bpp first-person game

https://forums.tigsource.com/index.php?topic=40832.msg1363742#msg1363742
1•PaulHoule•24m ago•0 comments

Self-hosting Git and builds without running a bunch of web services

https://duggan.ie/posts/self-hosting-git-and-builds-without-running-a-bunch-of-web-services
1•thunderbong•26m ago•0 comments

Transaction Boundaries Are a Business Concept

https://rafaelfiume.blog/2025/12/24/encoding-transaction-boundaries-as-business-concepts/
1•rfiume•27m ago•0 comments

OpenClaw Monitoring and Observability with OpenTelemetry

https://signoz.io/docs/openclaw-monitoring/
2•pranay01•27m ago•0 comments

Cnsplots: A Python library for Cell, Nature, and Science quality plots

https://github.com/faridrashidi/cnsplots
1•faridrashidi•27m ago•1 comments

A Model for Economic Freedom on Mars

https://arxiv.org/abs/2406.10380
1•AFF87•28m ago•0 comments

How the Trump Administration Weaponized Antifa

https://weaponizedspaces.substack.com/p/how-the-trump-administration-weaponized
2•rbanffy•28m ago•0 comments

U.S. imposes sanctions over North Korean scheme to use remote workers

https://www.cbsnews.com/news/north-korea-us-sanctions-remote-workers-weapons-program/
4•rbanffy•28m ago•0 comments

Powering AI: Europe switches on first microgrid data center in Dublin

https://www.cnbc.com/2026/03/11/data-center-microgrid-power-ireland-ai-boom-avk-pure-dc.html
1•rbanffy•29m ago•0 comments

Update on Companies House WebFiling security issue

https://www.gov.uk/government/news/update-on-companies-house-webfiling-security-issue
1•edward•29m ago•0 comments

AI world models need to understand cause and effect

https://www.ft.com/content/5e8a8567-fcc4-4863-8473-4a5964f93acb
1•samizdis•29m ago•0 comments

Single American Women Are Buying Homes in Record Numbers, Surpassing 20M

https://www.ibtimes.co.uk/record-surge-single-female-homeowners-us-1785842
3•randycupertino•30m ago•0 comments