https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF
Launching seven new MAI models: https://microsoft.ai/news/building-a-hillclimbing-machine-la...
https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF
Launching seven new MAI models: https://microsoft.ai/news/building-a-hillclimbing-machine-la...
https://microsoft.ai/news/introducingmai-code-1-flash/
and the model card
https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF
The broader announcement of 7 MAI models seems to be where the 5B active in the title comes from
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
This model might have a perfect speed:
for i in range(100):
print(random.choices(words))Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?
It is my belief that smaller models will get better and better, and even cloud SOTA models will shrink.
Yet another reason the current buildout will feel like the railroads.
That's what I'm betting on anyway.
(() => {
const KILL = ['wheel', 'mousewheel', 'DOMMouseScroll', 'touchmove'];
const block = e => e.stopImmediatePropagation();
for (const t of KILL) {
window.addEventListener(t, block, { capture: true, passive: true });
document.addEventListener(t, block, { capture: true, passive: true });
}
document.documentElement.classList.remove('lenis','lenis-smooth','lenis-scrolling','lenis-stopped');
console.log('Scroll hijack disabled — native scrolling restored.');
})();Even if it were Opus, comparing to a version number makes for an interesting snapshot of time comparison: if you knew how a model performed at whatever time in was in vogue, you can say "well, it looks like Model X is about 6 months/1 year/etc. behind the frontier SOTA" - which is exactly the discussion that happens in the open-weight/local LLM space. (interesting, MAI-Code-1-Flash does not appear to be such an open-weight model, following the western trend of locking models up)
Seems like the work from a good system design to code is practically solved.
Now it’s a matter of the design of the system. Or is that represented in these evals?
Even if I had no idea, going with the default suggestion would not be a terrible mistake, assuming you did describe your requirements relatively well.
For example you probably don't have days where you ask Opus to review your whole code base and look for code duplication/technical debt/robustness issues, and then to fix some of the found issues, and do this 3-5 times until no big issues are found anymore.
Performance doesn't seem that good:
- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro
- Qwen3.6-35B-A3B = 49.5% on SWE-bench pro (https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.
That scroll effect is jank city for me (yeah yeah works fine in Chrome/Edge).
But it seems like, by and large, even the faster models are now aimed at longer-running agentic flows and not sub-1s autocomplete. Or am I wrong about that?
Why not assign them to make windows good :D
With Opus I can work, trust its designs, architecture suggestions, and code changes, even in a complex code base.
The smaller models seem to "try". They work for smaller tasks, but for more complex task it's often more work than doing it myself.
I wish it were different, and maybe in a year or two it will be.
OsrsNeedsf2P•59m ago
lemonish97•57m ago
fooker•55m ago
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.