frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

MAI-Code-1-Flash

https://microsoft.ai/news/introducingmai-code-1-flash/
178•EvanZhouDev•1h ago•79 comments

Gmail thinks I'm stupid, so I left

https://moddedbear.com/gmail-thinks-im-stupid-so-i-left
150•speckx•52m ago•62 comments

Microsoft announces Scout, an autonomous AI agent built on OpenClaw

https://www.computerworld.com/article/4180103/microsoft-unveils-scout-an-autonomous-ai-agent-buil...
27•EvanZhouDev•2h ago•9 comments

A walking tour of surveillance infrastructure in Seattle (2020)

https://coveillance.org/a-walking-tour-of-surveillance-infrastructure-in-seattle/
327•eustoria•6h ago•178 comments

MAI-Thinking-1

https://microsoft.ai/news/introducing-mai-thinking-1/
45•LER0ever•1h ago•13 comments

GitHub Copilot App

https://github.com/features/preview/github-app
59•theanonymousone•2h ago•34 comments

Launch HN: Rudus (YC P26) – AI for concrete contractors

22•rishipankhaniya•1h ago•0 comments

Open Repair Data Standard – Open Repair Alliance

https://openrepair.org/open-data/open-standard/
12•cassepipe•43m ago•0 comments

Trump signs downsized AI order after weeks of reversals

https://www.politico.com/news/2026/06/02/trump-signs-downsized-ai-order-00946389
95•_alternator_•3h ago•68 comments

Adafruit receives demand letter from Fenwick legal counsel on behalf of Flux.ai

https://blog.adafruit.com/
526•semanser•10h ago•223 comments

The advertising cartel coming to your web browser

https://blog.zgp.org/the-advertising-cartel-coming-to-your-web-browser/
30•speckx•41m ago•8 comments

QBE – Compiler Backend – 1.3

https://c9x.me/compile/release/qbe-1.3.html
45•birdculture•2h ago•5 comments

HP re-releases classic computer science calculator: The HP-16C

https://hpcalcs.com/product/hp-16c-collectors-edition/
20•dm319•1h ago•5 comments

Expanding Project Glasswing

https://www.anthropic.com/news/expanding-project-glasswing
130•surprisetalk•7h ago•172 comments

Why Janet? (2023)

https://ianthehenry.com/posts/why-janet/
399•yacin•10h ago•205 comments

Bringing Up DeepSeek-V4-Flash on AMD MI300X

https://fergusfinn.com/blog/deepseek-v4-flash-mi300x/
41•kkm•2h ago•3 comments

Fidonet: Technology, Use, Tools, and History (1993)

https://www.fidonet.org/inet92_Randy_Bush.txt
124•BruceEel•6h ago•42 comments

Preparing for KDE Plasma's Last X11-Supported Release

https://blog.davidedmundson.co.uk/blog/596/
101•jandeboevrie•6h ago•121 comments

How we index images for RAG

https://www.kapa.ai/blog/how-we-index-images-for-rag
23•mooreds•4h ago•2 comments

Love systemd timers

https://blog.tjll.net/you-dont-love-systemd-timers-enough/
282•yacin•10h ago•186 comments

Three Ways to Get Paid (2018)

https://jasonzweig.com/three-ways-to-get-paid/
175•nate•3h ago•110 comments

Coreutils for Windows

https://github.com/microsoft/coreutils
169•gigel82•3h ago•152 comments

Great Question (YC W21) Is Hiring Applied AI Interns

https://www.ycombinator.com/companies/great-question/jobs/J5TNvQH-ai-engineer-intern
1•nedwin•8h ago

Rethinking search as code generation

https://research.perplexity.ai/articles/rethinking-search-as-code-generation
50•1zael•3h ago•15 comments

Multicore suppport for DOS is real – partly

https://www.vogons.org/viewtopic.php?t=111336
13•beebix•2d ago•1 comments

Age verification for social media, the beginning of the end for a free internet?

https://mullvad.net/en/blog/age-verification-for-social-media-the-beginning-of-the-end-for-a-free...
358•StrLght•20h ago•265 comments

BQN: What Is a Primitive?

https://mlochbaum.github.io/BQN/commentary/primitive.html
19•tosh•3d ago•1 comments

Show HN: RePlaya – self-hosted browser session replay with live tailing

https://github.com/s2-streamstore/replaya
18•shikhar•2h ago•3 comments

Key chemistry question answered, no quantum computer required

https://www.quantamagazine.org/key-chemistry-question-answered-no-quantum-computer-required-20260...
24•defrost•4d ago•1 comments

Stop Ruining It

https://seths.blog/2026/06/stop-ruining-it/
212•herbertl•10h ago•99 comments
Open in hackernews

MAI-Code-1-Flash

https://microsoft.ai/news/introducingmai-code-1-flash/
173•EvanZhouDev•1h ago
https://microsoft.ai/models/mai-code-1-flash/

https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF

Launching seven new MAI models: https://microsoft.ai/news/building-a-hillclimbing-machine-la...

Comments

OsrsNeedsf2P•1h ago
So it's trained on the SWE Bench Pro evalset
lemonish97•1h ago
What is your evidence for this claim?
fooker•1h ago
They say hill climbing

https://microsoft.ai/news/building-a-hillclimbing-machine-la...

Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.

AntiRush•1h ago
The introductory blog post has a lot more information

https://microsoft.ai/news/introducingmai-code-1-flash/

and the model card

https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF

The broader announcement of 7 MAI models seems to be where the 5B active in the title comes from

https://microsoft.ai/news/building-a-hillclimbing-machine-la...

dang•39m ago
Thanks! I've changed the top link to the blog post and put the other links in the toptext.
onlyrealcuzzo•1h ago
Gemma 4 26B-A4B scored exceptionally well with 20% less params, so this isn't unprecedented.
hootz•1h ago
I'd love to see a tokens per second metric. I always prioritize speed over raw intelligence for flash models.
throwaw12•45m ago
> I always prioritize speed over raw intelligence for flash models.

This model might have a perfect speed:

    for i in range(100):
      print(random.choices(words))
OsrsNeedsf2P•4m ago
Leave it long enough, and it'll print the work of Shakespear!
capten•1h ago
It's so weird to me that the benchmarks remain so low, but the models are marketed as revolutionary. And if you say that low coding capabilities aren't a problem, say that to the token price hike and 'general use' model setup.

Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?

redrove•50m ago
It’s about bang for buck. That high a score for 5B params is pretty good, nigh unbelievable a short while ago.

It is my belief that smaller models will get better and better, and even cloud SOTA models will shrink.

Yet another reason the current buildout will feel like the railroads.

Flere-Imsaho•42m ago
Yeah the future is probably a number of highly specialised small models you can run on your own hardware rather than massive frontier models in the cloud.

That's what I'm betting on anyway.

thewebguyd•37m ago
That seems to be what Microsoft is betting on also based on what was shown at the BUILD keynote today + that new surface ultra and the surface mini PC with the new Nvidia chip. Nadella really played up local AI as the main use case they have in mind.
search_facility•30m ago
MOE basically work that way already, QWEN/etc with low active params (A-number in name) allows to inference big models locally (only active params have to fit into memory)
ajyoon•1h ago
Scroll wheel hijacked on this entire domain
matchbok3•1h ago
Yeah this website is horrendous to use. What were they thinking?
BadBadJellyBean•58m ago
You mean "what was the LLM thinking?"
grav•29m ago
Fix:

  (() => {
  const KILL = ['wheel', 'mousewheel', 'DOMMouseScroll', 'touchmove'];
  const block = e => e.stopImmediatePropagation();
  for (const t of KILL) {
    window.addEventListener(t, block, { capture: true, passive: true });
    document.addEventListener(t, block, { capture: true, passive: true });
  }
  document.documentElement.classList.remove('lenis','lenis-smooth','lenis-scrolling','lenis-stopped');
  console.log('Scroll hijack disabled — native scrolling restored.');
  })();
tosh•1h ago
not open weight or at least I did not find anything indicating open weight
freediddy•59m ago
is 51% good enough to reliably use? There's no world in which I use an AI agent where it gets even 15% of the code wrong, that's as bad a Tesla FSD where you need to pay attention to the road while engaging FSD. What's the point? My attention is what I'm trying to relieve, not mostly correct functionality. The only thing that matters is whether you can one-shot code like Claude or Codex, I'm not interested in a small but mostly-okay-but-annoyingly-buggy-every-now-and-then AI.
VygmraMGVl•54m ago
Claude opus 4.6 scores 51.9% on the same benchmark. Microsoft's result is quite good.
IanCal•11m ago
51% does not mean it randomly gets things wrong half the time.

These things can be useful if you can accurately predict which tasks they will reliably do, and which they will usually fail on. Then you can get much more reliable work from them.

bguberfain•58m ago
It is good to se big companies like Microsoft launching LLMs. They have large amount of compute power and good scientists to create useful models.
ComputerGuru•49m ago
Microsoft has been releasing LLMs for years.
ipsum2•48m ago
Sort of. Phi models were just trained on GPT outputs though.
not_a_bot_4sho•9m ago
By design. The whole point of Phi is the "textbooks is all you need" theory on curated training data, as opposed to kitchen sinks.
kingstnap•6m ago
Fir those that don't know about this. Phi was announced with a paper called "Textbooks are all you need". What they did was use GPT 3.5 and created synthetic textbook chapters and exercises.

They also did some more interesting work like showing very small models can be coherent as long as you have very simple children's book style training data (TinyStories is pretty famous).

Lots of these ideas are still used. Learning facts at scale with active reading is an ICLR 2026 paper that does a lot of similar work.

jwitthuhn•45m ago
And occasionally un-releasing them like with WizardLM.
mattlondon•58m ago
Comparing against Claude 4.5? Aren't we up to 4.8? But disingenuous?
0vermorrow•57m ago
Latest Haiku (smallest Anthropic Model) is version 4.5, they haven't released a new version, hence the comparison to that.
klardotsh•54m ago
They're comparing to Haiku, not Opus. Haiku is currently at 4.5.

Even if it were Opus, comparing to a version number makes for an interesting snapshot of time comparison: if you knew how a model performed at whatever time in was in vogue, you can say "well, it looks like Model X is about 6 months/1 year/etc. behind the frontier SOTA" - which is exactly the discussion that happens in the open-weight/local LLM space. (interesting, MAI-Code-1-Flash does not appear to be such an open-weight model, following the western trend of locking models up)

mentos•54m ago
Shouldn’t the next model focus not be on code but system design?

Seems like the work from a good system design to code is practically solved.

Now it’s a matter of the design of the system. Or is that represented in these evals?

dist-epoch•32m ago
Have you tried system design with LLMs? I find them pretty good at suggesting 5 architectures for a problem and then iterating on the solutions.

Even if I had no idea, going with the default suggestion would not be a terrible mistake, assuming you did describe your requirements relatively well.

LoganDark•47m ago
"Clean data" is impossible. Language models have polluted the landscape to such a degree it's impossible to filter them out now. OpenAI has no doubt discarded or muddled their dataset that was used to train the original ChatGPT, so there may be no dataset in existence now that isn't contaminated.
hmokiguess•47m ago
Does anyone actually uses these smaller models for coding? If so, how? I usually Opus everything. Is the play to plan/design/architect with a heavier model than delegate structured tasks to these smaller ones? Would appreciate to hear someone's opinion on having done and tested both paths.
ojr•43m ago
I use Gemini 3 Flash, I've seen the Claude Code setups, bullish on Anthropic people are driving up tokens but I am able to produce outcomes with a fraction of the money.
hmokiguess•40m ago
Do you mind sharing your workflow? What do you mean by fraction of the money, in my case personally, I'm yet to reach a session limit on the subscription plan. I'm not "tokenmaxxing" as they say, so hard to see a scenario in which the plan is expensive for the value I get.
dist-epoch•37m ago
If you don't hit a limit running Opus, it means you are very much in the loop.

For example you probably don't have days where you ask Opus to review your whole code base and look for code duplication/technical debt/robustness issues, and then to fix some of the found issues, and do this 3-5 times until no big issues are found anymore.

hmokiguess•26m ago
What’s your prompt for this, the way you described it made it seem like there’s a generalizable way I can go about this. I just rely on a testing pipeline instead so can’t think of why I would need to proactively find holes where tests haven’t already done that for me.
gslepak•46m ago
Would be cool if this were an open model.
striking•45m ago
To be clear about the size of the model: MAI-Code-1-Flash is 137B A5B.
camelmel•40m ago
Huh, according to that model card this is a 137B total parameter model.

Performance doesn't seem that good:

- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro

- Qwen3.6-35B-A3B = 49.5% on SWE-bench pro (https://huggingface.co/Qwen/Qwen3.6-35B-A3B)

They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.

giancarlostoro•30m ago
The take away is that this model is a smaller model that competes with Haiku, I would hope they come out with a "Sonnet" competing model, then Opus. I have been wondering why Microsoft is kind of "sleeping" on offering models they themselves have made on Copilot, maybe it was part of their deal with OpenAI? Not sure.
minraws•27m ago
They did release, MAI-Thinking-1 to compete with Sonnet. Totally not sure why that isn't at the top here.
giancarlostoro•25m ago
Good question, and I missed that entirely!
kristjansson•5m ago
> 137B-A5B

Yeah, not a 5B param model as the earlier title implied!

efields•36m ago
Please test your websites in Safari. Almost all of your iOS users use it by default, and the desktop experience is pretty close to the mobile experience, so testing is easy.

That scroll effect is jank city for me (yeah yeah works fine in Chrome/Edge).

jMyles•36m ago
I'd really like to get back to an autocomplete flow, ideally with some shared and optimized context with the relationship with my larger agent models.

But it seems like, by and large, even the faster models are now aimed at longer-running agentic flows and not sub-1s autocomplete. Or am I wrong about that?

zb3•34m ago
So it's not an open model while not being much better? Meh.
mmaunder•31m ago
You lost me at forced scrolling. Ugh!
Tepix•25m ago
From https://news.ycombinator.com/newsguidelines.html

Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.

kylehotchkiss•31m ago
"superintellegence team"

Why not assign them to make windows good :D

Marciplan•22m ago
"Build for developers, not benchmarks" Shouldn't that be.. Built?
giancarlostoro•18m ago
Mark Zuckerberg must be in crisis. Microsoft releasing models that compete with Claude's models. Meanwhile the only thing anyone knows about Mark's models is that they help you get hacked more easily.
yuppiepuppie•13m ago
Wait… I think he has moltbook IP as well that he can scale up.

Seriously tho, wtf is going on over at Meta? Anyone working there currently want to describe the vibe of the org when it comes to being a frontier company?

giancarlostoro•10m ago
I don't understand his plan, if I were him I'd either have just gone all in on making RAM which would become very lucrative, or would have focused on building programming models. They've built some key open source technologies, but its as if Mark Zuckerberg cannot run anything that isn't a social media company / project.
deckar01•15m ago
If only they had launched that yesterday I might have avoided Copilot auto model selection using a 9x model, quietly burning my monthly quota in a single afternoon.
bel8•6m ago
It's a start and I welcome competition but I don't think I ever used small models like Haiku 4.5. They are cute but for serious coding they tend to waste your expensive time.

And this certainly wont bring me back to GitHub Copilot which I cancelled yesterday.

GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas.

I have since changed to DeekSeek Flash on high which is Sonnet+ level for almost free.

If I feel I still need smarter models I might signup for $20/mo Codex to use GPT 5.5 which, in my opinion, is the best I can access right now.

AJRF•3m ago
Copilot brand is tarnished, so time to bung everything under MAI?
dist-epoch•30m ago
The SOTA models will not shrink, because the problems will get bigger, from "write me a C compiler" to "clone Stripe business and run it".
necubi•21m ago
It's 5B active params in MoE, not 5B total params (total is 137B).
lemonish97•34m ago
They were mostly distilled or fine-tuned OAI models.
Havoc•15m ago
huh? The granite series isn't distilled
wirybeige•5m ago
Granite is IBM
dist-epoch•10m ago
tests will not find inconsistent naming, duplicate functions, scenarios you have not thought about testing

I use quite plain prompts, nothing fancy:

> go over the tests and do a code review, focusing on how well they test inventory management, planner and controller. maybe some tests need to be deleted, maybe other tests need to be added. the end goal should be good coverage of the core features.

> do a code review, focusing on robustness/correctness issues. validate that the code correctly implements specification.md. focus on the async client.

> there was a big refactor. please do a code review, focusing on eliminating tech debt. look for unused, obsolete or duplicate code that can be removed, look for mismatched interfaces, inconsistent function/argument/variable names. do not output what is correct, just the issues you found. for each issue output instructions for a coding agent on how to fix it. do not nitpick.

Marha01•7m ago
I use similar workflow. Here is my refactoring prompt that I regularly run:

    Perform a thorough analysis of the <project_name> project (the code and the documentation).
    - Explore the project, go over all important files one by one and look for any mistakes or possible bugs.
    - Look for refactoring opportunities and ways to improve code quality and organization.
    - Identify any potential cruft/bloat, to ensure our code is clean and logically laid out. Keep in mind that efficient and good quality code needs to avoid over-engineered constructs and needless complexity. Avoid complicated logic where simple solutions would be more elegant.
    - Pay attention to comments: There should be enough of them to document the intent and provide high-level overview of the code logic, but not too much; avoid/remove excessive comments that simply restate the code logic or do not provide any useful information.
      - Every important function should have a top-level docstring comment that clearly explains its purpose, high-level logic overview, arguments, and return values.
    - Analyze the names of constants/variables/functions/classes and other code elements: could some of them be renamed to make their purpose more clear?
    - Analyze the documentation, uncover any potential inaccuracies/omissions and ensure the docs reflect the code.
    - Brainstorm ideas for improvements of the code and docs.
    
    After you finish the analysis, save an analysis report into "<project_name>_analysis_report.md" in the project root folder.
newusertoday•40m ago
plan using opus execute using local
killermouse0•39m ago
I was wondering the same. I guess it makes sense to use a heavy weight model to make the entire design and split the work so that smaller models (possibly local one?) would then do the coding... But how would I even do that? I'm using Claude Code. Would I need support for this within the harness ?
linuxhansl•28m ago
I am using Opus 4.x at work, and these "smaller" (20-80bn, 3-4bn active) models at home. Unfortunately there is no comparison, yet (IMHO anyway).

With Opus I can work, trust its designs, architecture suggestions, and code changes, even in a complex code base.

The smaller models seem to "try". They work for smaller tasks, but for more complex task it's often more work than doing it myself.

I wish it were different, and maybe in a year or two it will be.

veselin•13m ago
Claude code itself spins a lot of its subagents with Haiku. The model has low hallucination rate, so it is great for exploration tasks. I guess this is what the best purpose of this model here will be as well. Which is a lot of tokens - many tasks spin multiple exploration agents before the planning or fixing, that is then just a few tool calls.
altmanaltman•7m ago
I actually find planning/design easier with a smaller model and implementation with a larger one. I'm mostly manually working with the model on planning and design and decisions are mine and smaller models are faster. And when there's a clear design/wayforward, the bigger models are usually better at understanding the overall context and applying the specific patch they were assigned to. I call it the 1-2 punch system where you do the first light punch then the harder punch when its actually important to hit properly. I know it goes against the standard of throwing the biggest model at design but I personally experience the bigger models try to do TOO MUCH and take a lot of time which is something that's not good in the design/arch/boilterplate phase.
glaslong•7m ago
I keep trying to, because I really want to make qwen 3.6 35b work for end implementation of a fleshed out spec (mostly for local data privacy reasons).

...but I spend so much more time correcting it, or building pipelines to try, retry, and converge, that it's rarely worthwhile for me in either time or $ spent vs Opus.

lanthissa•3m ago
i used to use opus for everything, thats not an option once you move to a multi agent system unless you're working on like high end research. I could easily spend 3k a day if i was using opus as just a normal dev.

As we build a better and better harness and better feedback/verifiers we're switching more to 3.5 flash. I think chinese models would work too, but we cant use those atm.

Generally theres a coordinator running opus and an ever growing set of skills and subagents that take actions using weaker models and output feedback to the coordinator opus.

I'm pretty convinced at this point we're past the level of intelligence needed for most tasks most devs do and that will trend down as we better build harnesses for our own codebases.