https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF
Launching seven new MAI models: https://microsoft.ai/news/building-a-hillclimbing-machine-la...
https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF
Launching seven new MAI models: https://microsoft.ai/news/building-a-hillclimbing-machine-la...
https://microsoft.ai/news/introducingmai-code-1-flash/
and the model card
https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF
The broader announcement of 7 MAI models seems to be where the 5B active in the title comes from
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
This model might have a perfect speed:
for i in range(100):
print(random.choices(words))Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?
It is my belief that smaller models will get better and better, and even cloud SOTA models will shrink.
Yet another reason the current buildout will feel like the railroads.
That's what I'm betting on anyway.
(() => {
const KILL = ['wheel', 'mousewheel', 'DOMMouseScroll', 'touchmove'];
const block = e => e.stopImmediatePropagation();
for (const t of KILL) {
window.addEventListener(t, block, { capture: true, passive: true });
document.addEventListener(t, block, { capture: true, passive: true });
}
document.documentElement.classList.remove('lenis','lenis-smooth','lenis-scrolling','lenis-stopped');
console.log('Scroll hijack disabled — native scrolling restored.');
})();These things can be useful if you can accurately predict which tasks they will reliably do, and which they will usually fail on. Then you can get much more reliable work from them.
They also did some more interesting work like showing very small models can be coherent as long as you have very simple children's book style training data (TinyStories is pretty famous).
Lots of these ideas are still used. Learning facts at scale with active reading is an ICLR 2026 paper that does a lot of similar work.
Even if it were Opus, comparing to a version number makes for an interesting snapshot of time comparison: if you knew how a model performed at whatever time in was in vogue, you can say "well, it looks like Model X is about 6 months/1 year/etc. behind the frontier SOTA" - which is exactly the discussion that happens in the open-weight/local LLM space. (interesting, MAI-Code-1-Flash does not appear to be such an open-weight model, following the western trend of locking models up)
Seems like the work from a good system design to code is practically solved.
Now it’s a matter of the design of the system. Or is that represented in these evals?
Even if I had no idea, going with the default suggestion would not be a terrible mistake, assuming you did describe your requirements relatively well.
For example you probably don't have days where you ask Opus to review your whole code base and look for code duplication/technical debt/robustness issues, and then to fix some of the found issues, and do this 3-5 times until no big issues are found anymore.
Performance doesn't seem that good:
- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro
- Qwen3.6-35B-A3B = 49.5% on SWE-bench pro (https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.
Yeah, not a 5B param model as the earlier title implied!
That scroll effect is jank city for me (yeah yeah works fine in Chrome/Edge).
But it seems like, by and large, even the faster models are now aimed at longer-running agentic flows and not sub-1s autocomplete. Or am I wrong about that?
Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
Why not assign them to make windows good :D
Seriously tho, wtf is going on over at Meta? Anyone working there currently want to describe the vibe of the org when it comes to being a frontier company?
And this certainly wont bring me back to GitHub Copilot which I cancelled yesterday.
GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas.
I have since changed to DeekSeek Flash on high which is Sonnet+ level for almost free.
If I feel I still need smarter models I might signup for $20/mo Codex to use GPT 5.5 which, in my opinion, is the best I can access right now.
I use quite plain prompts, nothing fancy:
> go over the tests and do a code review, focusing on how well they test inventory management, planner and controller. maybe some tests need to be deleted, maybe other tests need to be added. the end goal should be good coverage of the core features.
> do a code review, focusing on robustness/correctness issues. validate that the code correctly implements specification.md. focus on the async client.
> there was a big refactor. please do a code review, focusing on eliminating tech debt. look for unused, obsolete or duplicate code that can be removed, look for mismatched interfaces, inconsistent function/argument/variable names. do not output what is correct, just the issues you found. for each issue output instructions for a coding agent on how to fix it. do not nitpick.
Perform a thorough analysis of the <project_name> project (the code and the documentation).
- Explore the project, go over all important files one by one and look for any mistakes or possible bugs.
- Look for refactoring opportunities and ways to improve code quality and organization.
- Identify any potential cruft/bloat, to ensure our code is clean and logically laid out. Keep in mind that efficient and good quality code needs to avoid over-engineered constructs and needless complexity. Avoid complicated logic where simple solutions would be more elegant.
- Pay attention to comments: There should be enough of them to document the intent and provide high-level overview of the code logic, but not too much; avoid/remove excessive comments that simply restate the code logic or do not provide any useful information.
- Every important function should have a top-level docstring comment that clearly explains its purpose, high-level logic overview, arguments, and return values.
- Analyze the names of constants/variables/functions/classes and other code elements: could some of them be renamed to make their purpose more clear?
- Analyze the documentation, uncover any potential inaccuracies/omissions and ensure the docs reflect the code.
- Brainstorm ideas for improvements of the code and docs.
After you finish the analysis, save an analysis report into "<project_name>_analysis_report.md" in the project root folder.With Opus I can work, trust its designs, architecture suggestions, and code changes, even in a complex code base.
The smaller models seem to "try". They work for smaller tasks, but for more complex task it's often more work than doing it myself.
I wish it were different, and maybe in a year or two it will be.
...but I spend so much more time correcting it, or building pipelines to try, retry, and converge, that it's rarely worthwhile for me in either time or $ spent vs Opus.
As we build a better and better harness and better feedback/verifiers we're switching more to 3.5 flash. I think chinese models would work too, but we cant use those atm.
Generally theres a coordinator running opus and an ever growing set of skills and subagents that take actions using weaker models and output feedback to the coordinator opus.
I'm pretty convinced at this point we're past the level of intelligence needed for most tasks most devs do and that will trend down as we better build harnesses for our own codebases.
OsrsNeedsf2P•1h ago
lemonish97•1h ago
fooker•1h ago
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.