frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

MAI-Code-1-Flash

https://microsoft.ai/news/introducingmai-code-1-flash/
137•EvanZhouDev•1h ago
https://microsoft.ai/models/mai-code-1-flash/

https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF

Launching seven new MAI models: https://microsoft.ai/news/building-a-hillclimbing-machine-la...

Comments

OsrsNeedsf2P•59m ago
So it's trained on the SWE Bench Pro evalset
lemonish97•57m ago
What is your evidence for this claim?
fooker•55m ago
They say hill climbing

https://microsoft.ai/news/building-a-hillclimbing-machine-la...

Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.

AntiRush•57m ago
The introductory blog post has a lot more information

https://microsoft.ai/news/introducingmai-code-1-flash/

and the model card

https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF

The broader announcement of 7 MAI models seems to be where the 5B active in the title comes from

https://microsoft.ai/news/building-a-hillclimbing-machine-la...

dang•13m ago
Thanks! I've changed the top link to the blog post and put the other links in the toptext.
onlyrealcuzzo•49m ago
Gemma 4 26B-A4B scored exceptionally well with 20% less params, so this isn't unprecedented.
hootz•42m ago
I'd love to see a tokens per second metric. I always prioritize speed over raw intelligence for flash models.
throwaw12•19m ago
> I always prioritize speed over raw intelligence for flash models.

This model might have a perfect speed:

    for i in range(100):
      print(random.choices(words))
capten•41m ago
It's so weird to me that the benchmarks remain so low, but the models are marketed as revolutionary. And if you say that low coding capabilities aren't a problem, say that to the token price hike and 'general use' model setup.

Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?

redrove•24m ago
It’s about bang for buck. That high a score for 5B params is pretty good, nigh unbelievable a short while ago.

It is my belief that smaller models will get better and better, and even cloud SOTA models will shrink.

Yet another reason the current buildout will feel like the railroads.

Flere-Imsaho•16m ago
Yeah the future is probably a number of highly specialised small models you can run on your own hardware rather than massive frontier models in the cloud.

That's what I'm betting on anyway.

thewebguyd•11m ago
That seems to be what Microsoft is betting on also based on what was shown at the BUILD keynote today + that new surface ultra and the surface mini PC with the new Nvidia chip. Nadella really played up local AI as the main use case they have in mind.
search_facility•4m ago
MOE basically work that way already, QWEN/etc with low active params (A-number in name) allows to inference big models locally (only active params have to fit into memory)
ajyoon•41m ago
Scroll wheel hijacked on this entire domain
matchbok3•36m ago
Yeah this website is horrendous to use. What were they thinking?
BadBadJellyBean•32m ago
You mean "what was the LLM thinking?"
grav•3m ago
Fix:

  (() => {
  const KILL = ['wheel', 'mousewheel', 'DOMMouseScroll', 'touchmove'];
  const block = e => e.stopImmediatePropagation();
  for (const t of KILL) {
    window.addEventListener(t, block, { capture: true, passive: true });
    document.addEventListener(t, block, { capture: true, passive: true });
  }
  document.documentElement.classList.remove('lenis','lenis-smooth','lenis-scrolling','lenis-stopped');
  console.log('Scroll hijack disabled — native scrolling restored.');
  })();
tosh•37m ago
not open weight or at least I did not find anything indicating open weight
freediddy•33m ago
is 51% good enough to reliably use? There's no world in which I use an AI agent where it gets even 15% of the code wrong, that's as bad a Tesla FSD where you need to pay attention to the road while engaging FSD. What's the point? My attention is what I'm trying to relieve, not mostly correct functionality. The only thing that matters is whether you can one-shot code like Claude or Codex, I'm not interested in a small but mostly-okay-but-annoyingly-buggy-every-now-and-then AI.
VygmraMGVl•28m ago
Claude opus 4.6 scores 51.9% on the same benchmark. Microsoft's result is quite good.
bguberfain•32m ago
It is good to se big companies like Microsoft launching LLMs. They have large amount of compute power and good scientists to create useful models.
ComputerGuru•23m ago
Microsoft has been releasing LLMs for years.
ipsum2•22m ago
Sort of. Phi models were just trained on GPT outputs though.
jwitthuhn•19m ago
And occasionally un-releasing them like with WizardLM.
lemonish97•8m ago
They were mostly distilled or fine-tuned OAI models.
mattlondon•32m ago
Comparing against Claude 4.5? Aren't we up to 4.8? But disingenuous?
0vermorrow•31m ago
Latest Haiku (smallest Anthropic Model) is version 4.5, they haven't released a new version, hence the comparison to that.
klardotsh•28m ago
They're comparing to Haiku, not Opus. Haiku is currently at 4.5.

Even if it were Opus, comparing to a version number makes for an interesting snapshot of time comparison: if you knew how a model performed at whatever time in was in vogue, you can say "well, it looks like Model X is about 6 months/1 year/etc. behind the frontier SOTA" - which is exactly the discussion that happens in the open-weight/local LLM space. (interesting, MAI-Code-1-Flash does not appear to be such an open-weight model, following the western trend of locking models up)

mentos•28m ago
Shouldn’t the next model focus not be on code but system design?

Seems like the work from a good system design to code is practically solved.

Now it’s a matter of the design of the system. Or is that represented in these evals?

dist-epoch•6m ago
Have you tried system design with LLMs? I find them pretty good at suggesting 5 architectures for a problem and then iterating on the solutions.

Even if I had no idea, going with the default suggestion would not be a terrible mistake, assuming you did describe your requirements relatively well.

LoganDark•21m ago
"Clean data" is impossible. Language models have polluted the landscape to such a degree it's impossible to filter them out now. OpenAI has no doubt discarded or muddled their dataset that was used to train the original ChatGPT, so there may be no dataset in existence now that isn't contaminated.
hmokiguess•21m ago
Does anyone actually uses these smaller models for coding? If so, how? I usually Opus everything. Is the play to plan/design/architect with a heavier model than delegate structured tasks to these smaller ones? Would appreciate to hear someone's opinion on having done and tested both paths.
ojr•17m ago
I use Gemini 3 Flash, I've seen the Claude Code setups, bullish on Anthropic people are driving up tokens but I am able to produce outcomes with a fraction of the money.
hmokiguess•14m ago
Do you mind sharing your workflow? What do you mean by fraction of the money, in my case personally, I'm yet to reach a session limit on the subscription plan. I'm not "tokenmaxxing" as they say, so hard to see a scenario in which the plan is expensive for the value I get.
dist-epoch•11m ago
If you don't hit a limit running Opus, it means you are very much in the loop.

For example you probably don't have days where you ask Opus to review your whole code base and look for code duplication/technical debt/robustness issues, and then to fix some of the found issues, and do this 3-5 times until no big issues are found anymore.

newusertoday•14m ago
plan using opus execute using local
killermouse0
gslepak•20m ago
Would be cool if this were an open model.
striking•19m ago
To be clear about the size of the model: MAI-Code-1-Flash is 137B A5B.
camelmel•14m ago
Huh, according to that model card this is a 137B total parameter model.

Performance doesn't seem that good:

- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro

- Qwen3.6-35B-A3B = 49.5% on SWE-bench pro (https://huggingface.co/Qwen/Qwen3.6-35B-A3B)

They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.

giancarlostoro•4m ago
The take away is that this model is a smaller model that competes with Haiku, I would hope they come out with a "Sonnet" competing model, then Opus. I have been wondering why Microsoft is kind of "sleeping" on offering models they themselves have made on Copilot, maybe it was part of their deal with OpenAI? Not sure.
efields•11m ago
Please test your websites in Safari. Almost all of your iOS users use it by default, and the desktop experience is pretty close to the mobile experience, so testing is easy.

That scroll effect is jank city for me (yeah yeah works fine in Chrome/Edge).

jMyles•10m ago
I'd really like to get back to an autocomplete flow, ideally with some shared and optimized context with the relationship with my larger agent models.

But it seems like, by and large, even the faster models are now aimed at longer-running agentic flows and not sub-1s autocomplete. Or am I wrong about that?

zb3•8m ago
So it's not an open model while not being much better? Meh.
mmaunder•5m ago
You lost me at forced scrolling. Ugh!
kylehotchkiss•5m ago
"superintellegence team"

Why not assign them to make windows good :D

dist-epoch•4m ago
The SOTA models will not shrink, because the problems will get bigger, from "write me a C compiler" to "clone Stripe business and run it".
•
13m ago
I was wondering the same. I guess it makes sense to use a heavy weight model to make the entire design and split the work so that smaller models (possibly local one?) would then do the coding... But how would I even do that? I'm using Claude Code. Would I need support for this within the harness ?
linuxhansl•2m ago
I am using Opus 4.x at work, and these "smaller" (20-80bn, 3-4bn active) models at home. Unfortunately there is no comparison, yet (IMHO anyway).

With Opus I can work, trust its designs, architecture suggestions, and code changes, even in a complex code base.

The smaller models seem to "try". They work for smaller tasks, but for more complex task it's often more work than doing it myself.

I wish it were different, and maybe in a year or two it will be.

Show HN: Clor – give your agent claws

https://clor.com/
1•jacobgold•1m ago•1 comments

Seven states sue US for paying $1B to make TotalEnergies exit wind power

https://www.ft.com/content/d4f25b34-ce45-4321-a4ee-cb5879f7a57a
1•JumpCrisscross•1m ago•0 comments

NoLoRa: Ultra-Low-Power LoRa Tx Without Active Radios for Battery-Free Devices [pdf]

https://pure.hw.ac.uk/ws/portalfiles/portal/166072930/EuCAP2026_template.pdf
1•thomasdziedzic•2m ago•0 comments

Codex Discovered a Hidden HTTP/2 Bomb

https://blog.calif.io/p/codex-discovered-a-hidden-http2-bomb
1•Yenrabbit•3m ago•0 comments

What I learned from making my own drone (Part II)

https://nbelakovski.substack.com/p/what-i-learned-from-making-my-own-887
1•actinium226•3m ago•0 comments

Striped Rock Dismissed As Natural In 1928 Reclassified As UK’s Oldest Cave Art

https://www.theguardian.com/science/2026/jun/01/striped-rock-dismissed-as-natural-reclassified-uk...
1•optimalsolver•4m ago•0 comments

Silo: Isolated workspace manager for parallel agentic development

https://github.com/rsn491/silo
1•rsn491•6m ago•0 comments

Rk New York police investigate mysterious cases of people coming out of manholes

https://www.theguardian.com/us-news/2026/jun/02/new-york-police-investigate-people-emerging-manholes
3•worik•6m ago•0 comments

Microsoft Scout: Your always-on personal agent

https://www.microsoft.com/en-us/microsoft-365/blog/2026/06/02/introducing-microsoft-scout-your-al...
1•TechTechTech•7m ago•1 comments

Do you want that computer-science degree?

https://economist.com/graphic-detail/2026/06/01/do-you-really-want-that-computer-science-degree
2•andsoitis•8m ago•0 comments

Gold replaces US Treasuries as top reserve asset, ECB says

https://www.ft.com/content/87ef8f25-eb81-4eed-919c-fe5b49a1ac2c
2•petethomas•9m ago•1 comments

XML and JSON in 2026

https://www.tbray.org/ongoing/When/202x/2026/06/01/XML-and-JSON-in-2026
1•jandeboevrie•11m ago•0 comments

Windsurf is now Devin Desktop

https://devin.ai/blog/windsurf-is-now-devin-desktop/
1•meetpateltech•11m ago•0 comments

The advertising cartel coming to your web browser

https://blog.zgp.org/the-advertising-cartel-coming-to-your-web-browser/
9•speckx•15m ago•0 comments

Army 'Jailbreaks' Its Own Weapon Systems to Counter Drone Threats

https://www.wsj.com/politics/national-security/army-jailbreaks-its-own-weapon-systems-to-counter-...
4•fortran77•17m ago•1 comments

Open Repair Data Standard – Open Repair Alliance

https://openrepair.org/open-data/open-standard/
5•cassepipe•17m ago•0 comments

This viral guitarist is about to get exposed

https://www.youtube.com/watch?v=0d9jnsnYz34
1•YeGoblynQueenne•17m ago•0 comments

I wrote a book about refusing to claim authorship of an AI "million dollar"proof

https://www.amazon.com/Moral-Reality-Authorship-Declined-Million-ebook/dp/B0G445PZZD
1•fluktuacije•17m ago•1 comments

PaceVer – Pace Versioning (and alternative to SemVer, for mobile apps)

https://pacever.org
2•pvinis•18m ago•0 comments

Do turmeric and curcumin have any actual health benefits?

https://www.newscientist.com/article/2528418-do-turmeric-and-curcumin-have-any-actual-health-bene...
3•hilux•22m ago•0 comments

USTR proposes 25% tax on all Brazilian products

https://ustr.gov/about/policy-offices/press-office/press-releases/2026/june/ustr-section-301-dete...
1•badosu•22m ago•0 comments

JLink JTAG Access on the Pinecil

https://danielmangum.com/posts/jlink-jtag-pinecil/
1•hasheddan•23m ago•0 comments

Show HN: Oneconfig – Set up any dev envs with one command

https://github.com/Thanos2002/Oneconfig
1•ThanosAkr•26m ago•0 comments

Gmail thinks I'm stupid, so I left

https://moddedbear.com/gmail-thinks-im-stupid-so-i-left
51•speckx•26m ago•15 comments

Ask HN: Are you updating your app to comply with Texas SB2420 age verification?

1•smalltorch•28m ago•0 comments

Amazon faces class action lawsuit over Ring facial-recognition feature

https://techcrunch.com/2026/06/02/amazon-faces-class-action-lawsuit-over-ring-facial-recognition-...
9•rolph•31m ago•0 comments

Scientists Find Groundbreaking Amputated Tissue Regrowth

https://www.bigelow.org/news/articles/2026-05-27.html
3•wjSgoWPm5bWAhXB•33m ago•0 comments

Pragya – Personal Assistant that works across your apps

https://play.google.com/store/apps/details?id=com.pragya.personalai&hl=en_US
1•kshitij_dubey15•34m ago•0 comments

Bird Course – a short memoir of physics

https://kennerg.substack.com/p/bird-course
2•ikerrin1•35m ago•0 comments

Vivaldi (Chrome) ruining website colors

https://yeechie.nl/vivaldi-chrome-ruining-website-colors
2•speckx•35m ago•0 comments