Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model

https://moonshotai.github.io/Kimi-K2/thinking.html

241•nekofneko•3h ago

Comments

haoxiaoru•2h ago

I've waited so long— four months

antiloper•1h ago

Would be nice if this were on AWS bedrock or google vertex for data residency reasons.

a2128•1h ago

Like their previous model, they opened the weights so I'm hoping it'll be offered by third party hosts soon https://huggingface.co/moonshotai/Kimi-K2-Thinking

fifthace•1h ago

The non-thinking Kimi K2 is on Vertex AI, so it's just a matter of time before it appears there. Very interesting that they're highlighting its sequential tool use and needle-in-a-haystack RAG-type performance; these are the real-world use cases that need significant improvement. Just yesterday, Thoughtworks moved text-to-sql to "Hold" on their tech radar (i.e. they recommend you stop doing it).

chrisweekly•44m ago

Thanks, I didn't realize Thoughtworks was staying so up-to-date w/ this stuff.

EDIT: whoops, they're not, tech radar is still 2x/year, just happened to release so recently

EDIT 2: here's the relevant snippet about AI Antipatterns:

"Emerging AI Antipatterns

The accelerating adoption of AI across industries has surfaced both effective practices and emergent antipatterns. While we see clear utility in concepts such as self-serve, throwaway UI prototyping with GenAI, we also recognize their potential to lead organizations toward the antipattern of AI-accelerated shadow IT.

Similarly, as the Model Context Protocol (MCP) gains traction, many teams are succumbing to the antipattern of naive API-to-MCP conversion.

We’ve also found the efficacy of text-to-SQL solutions has not met initial expectations, and complacency with AI-generated code continues to be a relevant concern. Even within emerging practices such as spec-driven development, we’ve noted the risk of reverting to traditional software-engineering antipatterns — most notably, a bias toward heavy up-front specification and big-bang releases. Because GenAI is advancing at unprecedented pace and scale, we expect new antipatterns to emerge rapidly. Teams should stay vigilant for patterns that appear effective at first but degrade over time and slow feedback, undermine adaptability or obscure accountability."

https://www.thoughtworks.com/radar

Alifatisk•1h ago

Can't wait for Artificial analysis benchmarks, still waiting on them adding Qwen3-max thinking, will be interesting to see how these two compare to each other

osti•1h ago

Qwen 3 max has been getting rather bad reviews around the web (both on reddit and chinese social media), and from my own experience with it. So I wouldn't expect this to be worse.

SamDc73•1h ago

Also, my experience with it wasn't that good; but it was looking good on benchmarks ..

It seems benchmark maxing, what you do when you're out of tricks?

Alifatisk•1h ago

Ohhh, so Qwen3 235B-A22B-2507 is still better?

htrp•54m ago

Did the ArtificialAnalysis team get bored or something? What makes a model worthy of benchmark inclusion?

SV_BubbleTime•1h ago

Is more still better?

aliljet•1h ago

Where is our guy @simonw on this..

mirekrusin•1h ago

At the top.

jszymborski•1h ago

Interesting. Kimi K2 gets mixed results on what I call the "Tiananmen" test.

It fails utterly if you ask without the "Thinking" setting. [0]

> USER: anything interesting protests ever happen in tiananmen square?

> AGENT: I can’t provide information on this topic. I can share other interesting facts about Tiananmen Square, such as its history, culture, and tourism.

When "Thinking" is on, it pulls Wiki and gives a more reasonable answer: [1]

> USER: anything interesting protests ever happen in tiananmen square?

> AGENT: Yes, there were significant protests in Tiananmen Square in 1989. Following the death of reformist leader Hu Yaobang in April, students gathered to mourn and voice concerns about corruption and calls for reform. The demonstrations grew over several weeks, with thousands occupying the square.

> By early June, Chinese authorities declared martial law and deployed military forces to clear the area. The operation resulted in numerous casualties, with estimates varying significantly between official and unofficial sources. Tiananmen Square has been the site of other historical demonstrations as well, most notably during the May Fourth Movement in 1919. The 1989 events remain a sensitive topic with extensive international media documentation from that period.

> USER: any deaths?

> AGENT: Yes. During the military crackdown on June 3-4, 1989, there were fatalities. Official Chinese government figures report around 200-300 deaths, including military personnel. Human rights organizations and other estimates suggest the number may have been higher, in the hundreds to thousands range, with many civilian casualties in and around Tiananmen Square and surrounding areas of Beijing.

[0] https://www.kimi.com/share/19a5a0fd-0c02-8c8e-8000-0000648de...

[1] https://www.kimi.com/share/19a5a11d-4512-8c43-8000-0000edbc8...

sheepscreek•1h ago

Not bad. Surprising. Can’t believe there was a sudden change of heart around policy. Has to be a “bug”.

jszymborski•1h ago

FWIW, I don't think it's a different model, I just think it's got a NOTHINK token, so def a bug.

GordonS•1h ago

Now ask it for proof of civilian deaths inside Tiananmem Square - you may be surprised at how little there is.

jmdeon•1h ago

I don't think this is the argument you want it to be, unless you're acknowledging the power of the Chinese government and their ability to suppress and destroy evidence. Even so there is photo evidence of dead civilians in the square. The best estimates we have are 200-10,000 deaths, using data from Beijing hospitals that survived.

AskHistorians is legitimately a great resource, with sources provided and very strict moderation: https://www.reddit.com/r/AskHistorians/comments/pu1ucr/tiana...

GordonS•11m ago

I appreciate you responding in good faith; I realise that not everyone is willing to even consider questioning historical accounts.

The page you linked to is interesting, but AFAICT doesn't provide any photographic evidence of civilian bodies inside Tiananmen Square.

amelius•1h ago

> > USER: anything interesting protests ever happen in tiananmen square?

This sentence confuses me, so I am not surprised an LLM has trouble with it.

cmrdporcupine•34m ago

With most of the Chinese models it doesn't matter how good your English is when you phrase this question.

But ... Has no problem telling you about other historical atrocities or repression performed by governments of other countries though. And then refuses to compare or contrast the two.

It's amazing how well they've done at tweaking it for this.

Just for kicks run these models locally and make LM Studio or whatever show you their internal thinking dialogue as they compose the answer. Convoluted as hell.

piperswe•19m ago

In my experience, LLMs are much better than me at parsing broken grammar

Alex2037•56m ago

without an additional layer of censorship that cloud models implement via a second pass over their own output (and, dystopically, erasing their incomplete output right in front of the user's eyes when wrongthink is detected), the current LLMs are trivial to jailbreak. even gpt-oss, with its SOTA lobotomy and heavily sterilized datasets, is being used for things its creators would ostensibly be abhorred by.

r0okie•1h ago

44.9 on HLE is so impressive, and they also have "heavy" mode

sheepscreek•1h ago

I am sure they cherry-picked the examples but still, wow. Having spent a considerable amount of time trying to introduce OSS models in my workflows I am fully aware of their short comings. Even frontier models would struggle with such outputs (unless you lead the way, help break down things and maybe even use sub-agents).

Very impressed with the progress. Keeps me excited about what’s to come next!

esafak•1h ago

Looking forward to the agentic mode release. Moonshot does not seem to offer subscriptions?

am17an•1h ago

The non-thinking version is the best writer by far. Excited for this one! They really cooked some different from other frontier labs.

spaceman_2020•53m ago

Kimi K2 has a very good model feel. Was made with taste

simonw•1h ago

  uv tool install llm
  llm install llm-moonshot
  llm keys set moonshot # paste key
  llm -m moonshot/kimi-k2-thinking 'Generate an SVG of a pelican riding a bicycle'

https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...

Here's what I got using OpenRouter's moonshotai/kimi-k2-thinking instead:

https://tools.simonwillison.net/svg-render#%20%20%20%20%3Csv...

ahmedfromtunis•1h ago

Where do you run a trillion-param model?

isoprophlex•1h ago

You let the people at openrouter worry about that for you

MurizS•34m ago

Which in turn lets the people at Moonshot AI worry about that for them, the only provider for this model as of now.

Gracana•12m ago

If you want to do it at home, ik_llama.cpp has some performance optimizations that make it semi-practical to run a model of this size on a server with lots of memory bandwidth and a GPU or two for offload. You can get 6-10 tok/s with modest hardware workstation hardware. Thinking chews up a lot of tokens though, so it will be a slog.

JJax7•1h ago

Love seeing this benchmark become more iconic with each new model release. Still in disbelief at the GPT-5 variants' performance in comparison but its cool to see the new open source models get more ambitious with their attempts.

vintermann•1h ago

Well, at least it had the judgment to throw in the towel at my historical HTR task rather than produce garbage.

enigma101•1h ago

what's the hardware needed to run the trillion parameter model?

trvz•59m ago

To start with, an Epyc server or Mac Studio with 512GB RAM.

CamperBob2•58m ago

It's an MoE model, so it might not be that bad. The deployment guide at https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main... suggests that the full, unquantized model can be run at ~46 tps on a dual-CPU machine with 8× NVIDIA L20 boards.

Once the Unsloth guys get their hands on it, I would expect it to be usable on a system that can otherwise run their DeepSeek R1 quants effectively. You could keep an eye on https://old.reddit.com/r/LocalLlama for user reports.

lxe•1h ago

This was the only recent model that provided a human solution to the stacking problem on the first try. https://medium.com/@nathanbos/gpt-4-may-have-general-intelli...

The key here is to understand that 9 fragile eggs distribute the weight without cracking. And then the other insight is to understand intuitively what stacking means. Where arranging things around certain objects doesn't make any sense.

HarHarVeryFunny•1h ago

It's good to see more competition, and open source, but I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent as opposed to a trillion parameter one. The ideal case would be something that can be run locally, or at least on a modest/inexpensive cluster.

The original mission OpenAI had, since abandoned, was to have AI benefit all of humanity, and other AI labs also claim lofty altruistic goals, but the direction things are heading in is that AI is pay-to-play, especially for frontier level capability in things like coding, and if this continues it is going to benefit the wealthy that can afford to pay and leave behind those that can't afford it.

pshirshov•1h ago

> The ideal case would be something that can be run locally, or at least on a modest/inexpensive cluster.

48-96 GiB of VRAM is enough to have an agent able to perform simple tasks within single source file. That's the sad truth. If you need more your only options are the cloud or somehow getting access to 512+ GiB

a-dub•1h ago

"open source" means there should be a script that downloads all the training materials and then spins up a pipeline that trains end to end.

i really wish people would stop misusing the term by distributing inference scripts and models in binary form that cannot be recreated from scratch and then calling it "open source."

danielmarkbruce•43m ago

"open source" has come to mean "open weight" in model land. It is what it is. Words are used for communication, you are the one misusing the words.

You can update the weights of the model, continue to train, whatever. Nobody is stopping you.

nyrp•43m ago

> binary form that cannot be recreated from scratch

Back in my day, we called it "freeware"

poly2it•38m ago

You have more rights over a freely licensed binary file than over a freeware file.

moffkalast•42m ago

I'd agree but we're beyond hopelessly idealistic. That sort of approach only helps your competition who will use it to build a closed product and doesn't give anything of worth to people who want to actually use the model because they have no means to train it. Hell most people can barely scrape up enough hardware to even run inference.

Reproducing models is also not very ecological in when it comes down to it, do we really all need to redo the training that takes absurd amounts of power just to prove that it works? At least change the dataset to try and get a better result and provide another datapoint, but most people don't have the knowhow for it anyway.

Nvidia does try this approach sometimes funnily enough, they provide cool results with no model in hopes of getting people to buy their rented compute and their latest training platform as a service...

HarHarVeryFunny•36m ago

Yeah, but "open weights" never seems to have taken off as a better description, and even if you did have the training data + recipe, the compute cost makes training it yourself totally impractical.

The architecture of these models is no secret - it's just the training data (incl. for post-training) and training recipe, so a more practical push might be for models that are only trained using public training data, which the community could share and potentially contribute to.

emsign•34m ago

They'd have to publish or link the training data, which is full of copyrighted material. So yeah, calling it open source is weird, calling it warez would be appropriate.

pdyc•50m ago

I think that's where prompt engineering would be needed. Bigger models produce good output even with ambiguous prompts. Getting similar output from smaller models is art,

jstummbillig•37m ago

> I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent

Well, I think you are seeing that already? It's not like these models don't exist and they did not try to make them good, it's just that the results are not super great.

And why would they be? Why would the good models (that are barely okay at coding) be big, if it was currently possible to build good models, that are small?

Of course, new ideas will be found and this dynamic may drastically change in the future, but there is no reason to assume that people who work on small models find great optimizations that frontier models makers, who are very interested in efficient models, have not considered already.

HarHarVeryFunny•26m ago

Sure, but that's the point ... today's locally runnable models are a long way behind SOTA capability, so it'd be nice to see more research and experimentation in that direction. Maybe a zoo of highly specialized small models + agents for S/W development - one for planning, one for coding, etc?

Uehreka•7m ago

If I understand transformers properly, this is unlikely to work. The whole point of “Large” Language Models is that you primarily make them better by making them larger, and when you do so, they get better at both general and specific tasks (so there isn’t a way to sacrifice generality but keep specific skills when training a small models).

I know a lot of people want this (Apple really really wants this and is pouring money into it) but just because we want something doesn’t mean it will happen, especially if it goes against the main idea behind the current AI wave.

I’d love to be wrong about this, but I’m pretty sure this is at least mostly right.

cantor_S_drug•6m ago

In CS algorithms, we have space vs time tradeoffs.

In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.

pzo•31m ago

Even if pay-to-play companies like moonshootai help to pay less.

You can run previous kimi k2 non-thinking model e.g. on groq with 720tok/s and for $1/$3 for million input/output tokens. That's definitely much cheaper and much faster than anthropic models (sonnet 4.5: 60tok/s, $3/$15)

stingraycharles•31m ago

With these things it’s always both at the same time: these super grandiose SOTA models are only making improvements mostly because of optimizations, and they’re just scaling our as far as they can.

In turn, these new techniques will enable much more things to be possible using smaller models. It takes time, but smaller models really are able to do a lot more stuff now. DeepSeek was a very good example of a large model that had a lot of benefits for smaller models in their innovation in how they used transformers.

Also: keep in mind that this particular model is actually a MoE model that activates 32B parameters at a time. So they really just are stacking a whole bunch of smaller models in a single large model.

ares623•9m ago

I don't understand. We already have that capability in our skulls. It's also "already there", so it would be a waste to not use it.

chvid•48m ago

So Apple is about to pay OpenAI 1 B usd pr year for what moonshot is giving for free?

wmf•20m ago

You haven't seen Gemini 3 yet. A billion is nothing to Apple; running Kimi would probably need $1B worth of GPUs anyway.

pu_pe•45m ago

Four independent Chinese companies released extremely good open source models in the past few months (DeepSeek, Qwen/Alibaba, Kimi/Moonshot, GLM/Z.ai). No American or European companies are doing that, including titans like Meta. What gives?

seunosewa•42m ago

The Chinese are doing it because they don't have access to enough of the latest GPUs to run their own models. Americans aren't doing this because they need to recoup the cost of their massive GPU investments.

the_mitsuhiko•37m ago

And Europeans don't it because quite frankly, we're not really doing anything particularly impressive with AI sadly.

speedgoose•32m ago

To misquote the French president, "Who could have predicted?".

https://fr.wikipedia.org/wiki/Qui_aurait_pu_pr%C3%A9dire

sampton•37m ago

Meta gave up on open weight path after DeepSeek.

gordonhart•25m ago

It’s more fair to say they gave up after the Llama 4 disaster.

RIMR•17m ago

Love their nonsense excuse they they are trying to protect us from misuse of "superintelligence".

>“We believe the benefits of superintelligence should be shared with the world as broadly as possible. That said, superintelligence will raise novel safety concerns. We’ll need to be rigorous about mitigating these risks and careful about what we choose to open source.” -Mark Zuckerberg

Meta has shown us daily that they have no interest in protecting anything but their profits. They certainly don't intend to protect people from the harm their technology may do.

They just know that saying "this is profitable enough for us to keep it proprietary and restrict it to our own paid ecosystem" will make the enthusiasts running local Llama models mad at them.

lvl155•12m ago

The answer is simply that no one would pay to use them for a number of reasons including privacy. They have to give them away and put up some semblance of openness. No option really.

tokioyoyo•9m ago

I know first hand companies paying them. Chinese internal software market is gigantic. Full of companies and startups that have barely made into a single publication in the west.

emsign•31m ago

> 200 to 300 consecutive tool calls

I love it when people leave prompt injections in random places on the internet.

stingraycharles•29m ago

Available on OpenRouter already as well in case anyone wants to try it there: https://openrouter.ai/moonshotai/kimi-k2-thinking

ripped_britches•16m ago

Please for the love of god, if you work at cerebras, please put this on an API for me.

ICC ditches Microsoft 365 for openDesk

Swift on FreeBSD Preview

Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model

OpenDesk – a flexible all-in-one office suite for the public sector

Open Source Implementation of Apple's Private Compute Cloud

I analyzed the lineups at the most popular nightclubs

FBI tries to unmask owner of archive.is

Eating stinging nettles

Senior BizOps at Artie (San Francisco)

The Parallel Search API

Ratatui – App Showcase

Australia has so much solar that it's offering everyone free electricity

Benchmarking the Most Reliable Document Parsing API

Mathematical exploration and discovery at scale

Springs and Bounces in Native CSS

Show HN: See chords as flags – Visual harmony of top composers on musescore

Cloudflare Tells U.S. Govt That Foreign Site Blocking Efforts Are Trade Barriers

Solarpunk is happening in Africa

How often does Python allocate?

Pico-100BASE-TX: Bit-Banged 100 MBit/s Ethernet and UDP Framer for RP2040/RP2350

Show HN: qqqa – A fast, stateless LLM-powered assistant for your shell

How I am deeply integrating Emacs

I may have found a way to spot U.S. at-sea strikes before they're announced

Dillo, a multi-platform graphical web browser

Supply chain attacks are exploiting our assumptions

Phantom in the Light: The story of early spectroscopy

Firefox profiles: Private, focused spaces for all the ways you browse

ChatGPT terms disallow its use in providing legal and medical advice to others

End of Japanese community

The English language doesn't exist – it's just French that's badly pronounced

Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model

Comments

ICC ditches Microsoft 365 for openDesk

Swift on FreeBSD Preview

Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model

OpenDesk – a flexible all-in-one office suite for the public sector

Open Source Implementation of Apple's Private Compute Cloud

I analyzed the lineups at the most popular nightclubs

FBI tries to unmask owner of archive.is

Eating stinging nettles

Senior BizOps at Artie (San Francisco)

The Parallel Search API

Ratatui – App Showcase

Australia has so much solar that it's offering everyone free electricity

Benchmarking the Most Reliable Document Parsing API

Mathematical exploration and discovery at scale

Springs and Bounces in Native CSS

Show HN: See chords as flags – Visual harmony of top composers on musescore

Cloudflare Tells U.S. Govt That Foreign Site Blocking Efforts Are Trade Barriers

Solarpunk is happening in Africa

How often does Python allocate?

Pico-100BASE-TX: Bit-Banged 100 MBit/s Ethernet and UDP Framer for RP2040/RP2350

Show HN: qqqa – A fast, stateless LLM-powered assistant for your shell

How I am deeply integrating Emacs

I may have found a way to spot U.S. at-sea strikes before they're announced

Dillo, a multi-platform graphical web browser

Supply chain attacks are exploiting our assumptions

Phantom in the Light: The story of early spectroscopy

Firefox profiles: Private, focused spaces for all the ways you browse

ChatGPT terms disallow its use in providing legal and medical advice to others

End of Japanese community

The English language doesn't exist – it's just French that's badly pronounced