Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge

https://thinkpol.ca/2026/04/30/an-open-weights-chinese-model-just-beat-claude-gpt-5-5-and-gemini-in-a-programming-challenge/

77•bazlightyear•1h ago

Comments

magicalhippo•41m ago

In a single challenge, measured by how performant the solution was.

Kimi K2.6 is definitely a frontier-sized model, so on the one hand it's not that surprising it's up there with the closed frontier models.

Being open is nice though, even though it doesn't matter that much for folks like me with a single consumer GPU.

echelon•36m ago

This is the future though. Open weights models that run on H200s provide far more opportunity to build products and real infrastructure around.

You can always distill this for your little RTX at home. But models shaped for consumer hardware will never win wide adoption or remain competitive with frontier labs.

This is something that _can_ compete. And it will both necessitate and inspire a new generation of open cloud infra to run inference. "Push button, deploy" or "Push button, fine tune" shaped products at the start, then far more advanced products that only open weights not locked behind an API can accomplish.

Now we just need open weights Nano Banana Pro / GPT Image 2, and Seedance 2.0 equivalents.

The battle and focus should be on open weights for the data center.

bitmasher9•22m ago

I don’t fully understand what open weights unlocks that cannot be accomplished via API from a product standpoint.

Open weights is great if you want to do additional training, or if you need on-prem for security.

stldev•12m ago

Or try to beat Anthropic's uptime.

mkl•10m ago

Multiple providers of the same model. That means competition for price, reliability, latency, etc. It also means you can use the same model as long as you want, instead of having it silently change behaviour.

keyle•29m ago

It absolutely does matter.

The enshittification will go unnoticed at first but I'm already finding my favourite frontier models severely nerfed, doing incredibly dumb stuff they weren't in the past.

We need open weight models to have a stable "platform" when we rely on them, which we do more and more.

magicalhippo•19m ago

Most people won't roll out their own K2 deployment across rented GPUs, so in that sense it doesn't matter that much, they'll be using a paid service which is just as much of a black box as Claude or ChatGPT. For example, on OpenRouter you can select a provider which state they use a given open model, but you have no idea what actually goes on behind the curtains, which quantization levels they use and so on.

That said, I do fully agree that it is valuable to have open near-frontier models, as a balance to the closed ones.

DeathArrow•24m ago

>Being open is nice though, even though it doesn't matter that much for folks like me with a single consumer GPU.

Of course it matters because that makes coding plans much cheaper than those from Anthropic and OpenAI.

For personal use I have coding plans with GLM 5.1, Kimi K2.6, MiniMax M2.7 and Xiaomi MiMo V2.5 Pro and I am getting a lot of bang for the buck.

magicalhippo•17m ago

Currently it's not a huge difference given the subsidies of closed model subscriptions. Once that stops then yea it will be really nice to have open models as price competitors.

PedroBatista•40m ago

Great to know, but what was the cost both in terms of $$ and tokens used?

Not to invalidate these benchmark results because they are useful, but the real usefulness it what they are capable to do when real people interact with them at scale.

Regardless, these are good news, because now that Microsoft is basically giving up their all-in strategy with Github's Copilot and Anthropic is playing the "I'm too good for you" game, it's about time for them to get pressed into not making this AI world into a divide between the haves and the have-nots.

keyle•27m ago

Re pricing. Never as high as frontier commercial models.

beering•38m ago

I’m a little confused as to the setup. It was asking each model to one-shot a script and then the scripts faced off? Were the models given a computer environment? Or a test server to iterate against?

rpmisms•37m ago

Sounds incredibly simple to me. One-shot.

beering•28m ago

So nothing like real-world coding, where you’d be able to run and test the script before submitting?

Frannky•35m ago

I have to try Kimi. I was looking for an alternative. If you have any experience, advice, please share. I saw Kimi is at the top of the Open Router ranking.

zorked•23m ago

I use Kimi at home via a kimi.com subscription and Kimi CLI (sometimes running inside Zed, sometimes not). My favorite model by far. And it's just $20.

I have to use a supposedly frontier model at work and I hate let.

Frannky•16m ago

Nice, thanks for sharing!

DeathArrow•21m ago

Kimi K2.6 is great but I advice you to get a coding plan from Kimi.com as that way is much cheaper than paying for API calls using OpenRouter.

Frannky•17m ago

Thanks, I am trying it right now. I had an opencode plan 5$/month, so I will play with that. I use ZED and I added Pi ACP, so I can try the both pi and Kimi. I will also try it in opencode and via Kimi code.

prvnsmpth•6m ago

Use kimi 2.6 for planning and a cheap model (preferably local) for execution, and then kimi once again for reviewing it. Then finally I review the code. Saves a lot on tokens.

elromulous•33m ago

Is the site just slashdotted rn? Can anyone get to it?

jakemanger•28m ago

What's the GPU VRAM requirements for this thing?

Awesome to have a open model that can compete, but damn it would be so much better if you could run it locally. Otherwise, it's almost so difficult to run (e.g. self host) that it's just way more convenient to pay OpenAI, Claude, etc

DeathArrow•19m ago

>Otherwise, it's almost so difficult to run (e.g. self host) that it's just way more convenient to pay OpenAI, Claude, etc

Getting a coding plan from Kimi.com will make coding 20x cheaper than using Anthropic.

BTW, I am using it with Claude Code.

slashdave•25m ago

I was surprised by the ranking, until I read what the test was. Not horribly relevant for coding.

The current ranking of all tests makes more sense (well, except for how well Gemini does)

https://aicc.rayonnant.ai

pbreit•18m ago

All my co-workers say Claude blows away Gemini. Is it really that good? How can I do Kimi?

prvnsmpth•9m ago

You can sign up for a plan on the kimi code platform and use it via the pi.dev coding agent, or opencode. In planning, I’d say it’s almost on par with Claude Opus.

justech•15m ago

I’ve been maining Kimi k2.6 through opencode go and openrouter for a week and I can say it’s the same experience as when I was maining Sonnet 3.5/4 late last year.

Not as good or as fast as Claude Code on Opus now but definitely enough for casual/hobby use. The best part is multiple choices for providers, if opencode gimps their service, I’ll switch

jrecyclebin•14m ago

I absolutely love Kimi's personality - some of the things it says are so out there! And it's been great for very focused, iterative work.

Its weakness is that it seems to yak on-and-on when it needs to plan out something big or read through and make sense of how to use a niche piece of a complex library. To the point where it can fill up its 256k window - and rack up a build. (No cache.) I have had better experience with GLM 5.1 in those cases.

Anyone out there relate?

anderber•5m ago

Absolutely. I use caveman to help with that: https://github.com/JuliusBrussee/caveman

gertlabs•6m ago

I'm glad we're seeing a shift towards objectively scored tests.

We've been doing this at scale at https://gertlabs.com/rankings, and although the author looks to be running unique one-off samples, it's not surprising to see how well Kimi K2.6 performed. Based on our testing, for coding especially, Kimi is within statistical uncertainty of Qwen3.6 Max and MiMo V2.5 Pro, and performs much better with tools than DeepSeek V4 Pro.

GPT 5.5 has a comfortable lead, but Kimi is on par with or better than Opus 4.6. The problem with Kimi 2.6 is that it's one of the slower models we've tested.

The Reality of Being a Man in Your 50s in South Korea

Mouse Pointer as a Mere Mortal

Quantum Machine Learning: A Pragmatic Guide for Classical ML Engineers

Redesigning Agent Skills – two missing parts

NodeMind – binary document index, 48× smaller than float32 RAG, no GPU required

MX Script: a scripting language for one file web APIs

Under sea internet cables need backup

The Duke in His Domain (1957)

Saudi Arabia ranks 2nd globally in data center market attractiveness

The Last Post

The AI Spending Trap: Why Adoption Outpaces Outcomes

55M Smartphones Run on the HarmonyOS

I built a free SEO/AEO/security scanner for websites

Can AI Design Therapies?

Show HN: Hangman Game

Big Tech will spend nearly $700B on AI in 2026. No one knows where buildout ends

Why sugarcane bagasse tableware is eco friendly?

Suno picks up Songkick as part of AI licensing deal with Warner Music

Show HN: ScaleBridge – sync Withings weigh-ins to Garmin Connect

Taking down a European network with a TLS certificate

Comparing the Z80 and 6502 to Their Relatives

Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge

Claude-powered AI agent's confession

Before Barbie: Mattel Engineering Company, Guided Missiles, and the Cold War

How electronic warfare is sowing confusion in cockpits

Ask HN: Best hardware for local inference

The first photo published in a newspaper

Germany claims it has the best bread

The Myriad Project: A separator for ten-thousands

Sequoia Ascent 2026 Summary

Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge

Comments

The Reality of Being a Man in Your 50s in South Korea

Mouse Pointer as a Mere Mortal

Quantum Machine Learning: A Pragmatic Guide for Classical ML Engineers

Redesigning Agent Skills – two missing parts

NodeMind – binary document index, 48× smaller than float32 RAG, no GPU required

MX Script: a scripting language for one file web APIs

Under sea internet cables need backup

The Duke in His Domain (1957)

Saudi Arabia ranks 2nd globally in data center market attractiveness

The Last Post

The AI Spending Trap: Why Adoption Outpaces Outcomes

55M Smartphones Run on the HarmonyOS

I built a free SEO/AEO/security scanner for websites

Can AI Design Therapies?

Show HN: Hangman Game

Big Tech will spend nearly $700B on AI in 2026. No one knows where buildout ends

Why sugarcane bagasse tableware is eco friendly?

Suno picks up Songkick as part of AI licensing deal with Warner Music

Show HN: ScaleBridge – sync Withings weigh-ins to Garmin Connect

Taking down a European network with a TLS certificate

Comparing the Z80 and 6502 to Their Relatives

Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge

Claude-powered AI agent's confession

Before Barbie: Mattel Engineering Company, Guided Missiles, and the Cold War

How electronic warfare is sowing confusion in cockpits

Ask HN: Best hardware for local inference

The first photo published in a newspaper

Germany claims it has the best bread

The Myriad Project: A separator for ten-thousands

Sequoia Ascent 2026 Summary