New coding models and integrations

221•meetpateltech•3mo ago

Comments

qwe----3•3mo ago

Just a paste of llama.cpp without attribution.

mchiang•3mo ago

https://github.com/ollama/ollama?tab=readme-ov-file#supporte...

swyx•3mo ago

i mean they have attributed but also it's open source software, i guess the more meaningful question is why didn't ggerganov build Ollama if it was that easy? or what is his company working on now?

homarp•3mo ago

>what is gg working on

supporting models so ollama can then 'support' them too

if you use llama.cpp server, it's quite a nice experience. you can even directly download stuff from Huggingface.

homarp•3mo ago

e.g. https://nitter.poast.org/ggerganov/status/197849119484267774...

monkmartinez•3mo ago

I can not answer for GG, but the early days of llama.cpp were crazy and everything was so very hacky. Remember, Textgen-webui was 'the way' to run models at first because it supported so many different quant types and file extensions. At the time, most people were using multiple different quantization methods and it was really hard to figure out which were performing better or worse objectively.

GGUF/GGML was like the 4th iteration of file type quantization from llama.cpp and I remember that I had to consciously begin watching the bandwidth usage from my ISP. Up to that point, I had never received an email warning me about reaching limits of my 2TB connection. All for the same models just in different forms. TheBloke was pumping out models like he had unlimited time/effort.

I say all that to say, llama.cpp was still trying, dare I say 'inventing', all the things throughout these transitions. Ollama comes in to make the running part easier and less CLI flag dependent building off of llama.cpp. Awesome.

GG and company are down in the trenches of the models architecture with CUDA, Vulkan, CPU, ROCm, etc. They are working on perplexity, token processing/generation and just look at the 'bin' folder when you compile the project. There are so many different aspects to make the whole thing work as well at it does. It's amazing that we have llama-server at all with the amount of work that has gone into making llama.cpp.

All that to say, Ollama shit the bed on attribution. They were called out on r/localllama very early on for not really giving credit to llama.cpp. They have a soiled reputation with the people that participate in that sub-reddit at least. They were called out for not contributing back if I remember correctly as well, which further stained their reputation among the folks who hang in that sub-reddit.

So it's not a matter of "ease" to build what Ollama built... At least from the perspective of someone who has been paying close attention from r/localllama; the problem was/is simply the perception (right or wrong) of the meme; Person 2 to person 1: "You built this?" -> Person 2: takes item/thing -> person 2: Holds up item/thing -> "I built this". A simple act that really pissed off the community in general.

speedgoose•3mo ago

Ollama is more than a paste. But the support for GLM 4.6 is indeed coming from llama.cpp: https://github.com/ollama/ollama/issues/12505#issuecomment-3...

I don’t know how much Ollama contributes to llama.cpp

am17an•3mo ago

The answer is 0

CaptainOfCoit•3mo ago

> I don’t know how much Ollama contributes to llama.cpp

If nothing else, Ollama is free publicity for llama.cpp, at least when they acknowledge they're mostly using the work of llama.cpp, which has happened at least once! I found llama.cpp by first finding Ollama and then figured I'd rather avoid the lock-in of Ollama's registry, so ended up using llama.cpp for everything.

speedgoose•3mo ago

By the way, you can use hugging face with ollama, and local modelfiles too.

CaptainOfCoit•3mo ago

You're saying that like you cannot do that with llama.cpp? I feel like most Ollama users seem to have no idea what features/benefits directly come from llama.cpp rather than Ollama itself...

simgt•3mo ago

I read the opposite, that you don't have to be locked-in Ollama's registry if you don't want to.

Could you share a bit more of what you do with llama.cpp? I'd rather use llama-serve but it seems to require a good amount of fiddling with the parameters to have good performance.

mtone•3mo ago

Recently llama.cpp made a few common parameters default (-ngl 999, -fa on) so it got simpler: --model and --context-size and --jinja generally does it to start.

We end up fiddling with other parameters because it provides better performance for a particular setup so it's well worth it. One example is the recent --n-cpu-moe switch to offload experts to CPU while filling all available VRAM that can give a 50% boost on models like gpt-oss-120b.

After tasting this, not using it is a no-go. Meanwhile on Ollama there's an open issue asking for this: https://github.com/ollama/ollama/issues/11772

Finally, llama-swap separately provides the auto-loading/unloading feature for multiple models.

monkmartinez•3mo ago

Nailed it. To make matters worse, Ollama obfuscate the models so their users don't really know what they are running until they dig into the model file. Only then can they see that what they thought was Deepseek-r1 is actually an 8B qwen distillation of Deepseek-r1, for example.

Luckily, we have Jan.ai and LM Studio which are happy to run GGUF models at full-tilt on various hardware configs. Added bonus; both include very nice API server as well.

jhancock•3mo ago

I've been using GLM-4.6 since its release this month. It's my new fav. Using it via Claude Code and the more simple Octofriend https://github.com/synthetic-lab/octofriend

Hosting through z.ai and synthetic.new. Both good experiences. z.ai even answers their support emails!! 5-stars ;)

mchiang•3mo ago

Z.ai team is awesome and very supportive. I have yet to try synthetic.new. What's the reason for using multiple? Is it mainly to try different models or are you hitting some kind of rate limit / usage limit?

jhancock•3mo ago

I tried synthetic.new prior to GLM-4.6...Starting in August...So I already had a subscription.

When z.ia launched GLM-4.6, I subscribed to their Coding Pro plan. Although I haven't been coding as heavy this month as the prior two months, I used to hit Claude limits almost daily, often twice a day. That was with both the $20 and $100 plans. I have yet to hit a limit with z.ai and the server response is at least as good as Claude.

I mention synthetic.new as it's good to have options and I do appreciate them sponsoring the dev of Octofriend. z.ai is a China company and I think hosts in Singapore. That could be a blocker for some.

mchiang•3mo ago

Do you find yourself sticking with GLM 4.6 over Claude for some tasks? Or do you find yourself still wanting to reach for Claude?

jhancock•3mo ago

I have been subscribing to both Claude and ChatGPT for over two years. Spent several months on Claude's $100 plan and couple months on ChatGPT's $200 plan but otherwise using their $20/month plans.

I cancelled Claude two weeks ago. Pure GLM-4.6 now and a tad of codex with my ChatGPT Pro subscription. I sometimes use ChatGPT for extended research stuff and non-tech.

theshrike79•3mo ago

I was a hardcore Claude fan too, but Sonnet 4.5 + the new weekly limits are really annoying.

I could deal with the limits, but holy shit is Sonnet 4.5 chatty. It produces as much useless crap as Opus 4.1 did. Might feel fun for Vibe Coders when the model pumps out tons of crap, but I want it to do what I asked, not try to get extra credit with "advanced" solutions and 500+ row "reports" after it's done. FFS.

Been testing crush + z.ai GLM 4.6 through Openrouter (had some credits in there it seems =) for this evening and I'm kinda loving it.

riskable•3mo ago

Z.ai is on the US Entities (banned from export/collab) list:

> “These entities advance the People’s Republic of China’s military modernization through the development and integration of advanced artificial intelligence research. This activity is contrary to the national security and foreign policy interests of the United States under Section 744.11 of the EAR.”

https://medium.com/ai-disruption/zhipu-ai-chinas-leading-lar...

tinfoilhatter•3mo ago

And Microsoft has been instrumental in helping to facilitate Israel's genocide of Palestinian people. Meta / Facebook did it in Myanmar. If you're paying to use any AI product, you're more than likely giving money to companies that either directly or indirectly contribute to genocide.

hodgehog11•3mo ago

My experience using GLM-4.6 with Charm Crush has been absolutely incredible, especially with high thinking. This is on pretty hard tasks too, e.g. proving small lemmas with Lean.

I've had much less luck with other agentic software, including Claude Code. For these kinds of tasks, only Codex seems to come close.

dkga•3mo ago

I had good experience with Codex iterating to prove a fixed point theorem. But will also now consider GLM-4.6.

bn-l•3mo ago

$3 a month and using it in Claude code is a matter of changing a few env vars which you copy and paste from their docs. Cost benefit wise there is nothing better.

codebje•3mo ago

$6/month. It's $3 for the first month (or first months, on longer subscription cycles, but it's first unit of subscription cycle at half price only).

At $6/month it's still pretty reasonable, IMO, and chucking less than $10 at it for three months probably gets you to the next pop-up token retailer offering introductory pricing, so long as the bubble doesn't burst before then.

bn-l•3mo ago

They changed the price after it got popular. Still good value imo

codebje•3mo ago

I wound up subbing for the three months. My experience has been pretty positive, using Cline in VS Code and moving away from Qwen3 Coder on OpenRouter. Q3C did a good job overall but I was using the free model (with $10 credit sitting on OR to increase limits) and that's pretty painful to sit through repeated 429 errors. GLM-4.6 has been comparable, maybe a fraction worse, but without 429 errors it blazes through tasks.

esafak•3mo ago

https://z.ai/subscribe

bravura•3mo ago

How do you use a non Anthropic model with Claude Code?

jhancock•3mo ago

they have a Claude Code specific endpoint...see the excellent docs https://docs.z.ai/devpack/tool/claude

theK•3mo ago

Why don't companies have an about page on their home page any more? How an I supposed to know what z.foo is really about? Random page vibes?

</rant>

shijithpk•3mo ago

Here you go -- https://zhipuai.cn/en/aboutus

glauber•3mo ago

I use liteLLM local running on Docker.

https://www.litellm.ai/

mike_d•3mo ago

> For users with more than 300GB of VRAM, qwen3-coder:480b is also available locally.

I haven't really stayed up on all the AI specific GPUs, but are there really cards with 300GB of VRAM?

bakugo•3mo ago

No, you need multiple GPUs. These models are not intended to be run by the average user.

OneDeuxTriSeiGo•3mo ago

Not necessarily. You need either multiple GPUs or unified memory. There are a handful of UM platforms out there nowadays (mainly Macs but AMD has some as well albeit none with 300GB ram)

packetslave•3mo ago

Also the just-released DGX Spark from Nvidia (although it "only" has 128gb of unified memory)

Computer0•3mo ago

One of its defining features is the ability to link them together at speeds about the same as their ram speed iirc.

Hamuko•3mo ago

You can buy an M3 Ultra Mac Studio and configure it with 512 GB of memory shared between the CPU and the GPU. Will set you back about $9500.

Schlagbohrer•3mo ago

And that'll be two orders of magnitude slower right?

speedgoose•3mo ago

In addition to the already mentioned Apple Mac Studio, NVIDIA sells the GH200 with up to 480GB of VRAM.

My local HPC went for the 120GB version though, but 4 per node.

bigyabai•3mo ago

Been disappointed to see Ollama list models that are supported by the cloud product but not the Ollama app. It's becoming increasingly hard to deny that they're only interested in model inference just to turn a quick buck.

mchiang•3mo ago

Qwen3-coder:30b is in the blog post. This is one that most users will be able to run locally.

We are in this together! Hoping for more models to come from the labs in varying sizes that will fit on devices.

bigyabai•3mo ago

I'm looking forward to future ollama releases that might attempt parity with the cloud offerings. I've since moved onto the Ollama compatibility API on KoboldCPP since they don't have any such limits with their inference server.

mchiang•3mo ago

I am super hopeful! Hardware is improving, inference costs will continue to decrease, models will only improve...

Balinares•3mo ago

How does Qwen3-Coder:30B compare to Instruct-2507 as a coding agent backend? I was under the impression that Instruct was intended to supersede Coder?

hephaes7us•3mo ago

In this case, it's not about whether it fits on my physical hardware or not. It's about what seems like an arbitrary restriction designed to start pushing users to their cloud offering.

colesantiago•3mo ago

I know this is disappointing, but what business model would be best here for ollama?

1. Donationware - Let's be real, tokens are expensive and if they ask for everyone to chip in voluntarily people wouldn't do that and Ollama would go bust quickly.

2. Subscriptions (bootstrapped and no VCs) again like 1. people would have to pay for the cloud service as a subscription to be sustainable (would you?) or go bust.

3. Ads - Ollama could put ads in the free version but to remove them the users can pay for a higher tier, a somewhat good compromise, except developers don't like ads and don't like pay for their tools unless their company does it for them. No users = Ollama goes bust.

4. VCs - This is the current model which is why they have a cloud product and it keeps the main product free (for now). Again, if they cannot make money or sell to another company Ollama goes bust.

5. Fully Open Source (and 100% free) with Linux Foundation funding - Ollama could also go this route, but this means they wouldn't be a business anymore for investors and rely on the Linux Foundation's sponsors (Google, IBM, etc) for funding the LF to stay sustainable. The cloud product may stay for enterprises.

Ollama has already taken money from investors so they need to produce a return for them so 5. isn't an option in the long term.

6. Acquisition by another company - Ollama could get acquired and the product wouldn't change* (until the acquirer jacks up prices or messes with the product) which ultimately kills it anyway as the community moves on.

I don't see any other way that Ollama can not be enshittified without making a quick buck.

You just need to avoid VC backed tools and pay for bootstrapped ones without any ties to investors.

CaptainOfCoit•3mo ago

> I don't see any other way that Ollama can not be enshittified without making a quick buck.

Me neither. The mistake they did was getting outside investments, as now they're no longer in full control and eventually are gonna have to at least give the impression they give a shit about the investors, and it'll come at the cost of the users one way or another.

Please pay for your tools that are independently developed, we really need more community funding of projects so we can avoid this never-ending spiral of VC-fueled+killed tools.

sanex•3mo ago

They got the investments before the company was even ollama. They exist because their VC was ok with them pivoting to build the current product. It's likely it wouldn't exist without the funding.

CaptainOfCoit•3mo ago

I dunno, the founders could have also changed their frame of mind, and started the project without VC investments. AFAIK, the founding team worked at Docker before, a company that doesn't pay peanuts, so I'm sure they could have scraped together enough to bootstrap it initially.

But I understand the added zeros to the (maybe) future payout when you take VC funds is hard to ignore, I'm not blaming them for anything really.

hephaes7us•3mo ago

Tokens are expensive, sure, but I don't even _want_ Ollama to run inference for me.

Ollama gives me, essentially, a wrapper for llama.cpp and convenient hosting where I can download models.

I'm happy to pay for the bandwidth, plus a premium to cover their running this service.

I'm furthermore happy to pay a small charge to cover the development that they've done and continue to do to make local-inference easy for me.

zozbot234•3mo ago

Aren't these models consistently quite large and hard to run locally? It's possible that future Ollama releases will allow you to dynamically manage VRAM memory in a way that enables these models to run with acceleration on even modest GPU hardware (such as by dynamically loading layers for a single 'expert' into VRAM, and opportunistically batching computations that happen to rely on the same 'expert' parameters - essentially doing manually what mmap does for you in CPU-only inference) but these 'tricks' will nonetheless come at non-trivial cost in performance.

vladsanchez•3mo ago

Ok, so that glm-4.6 doesn't/can't run locally? That's quite a disappointment

zozbot234•3mo ago

For those interested in building Ollama locally, note that as of a few hours ago, experimental Vulkan Compute support (will not be in official binary releases as of yet) has been merged on the github main branch and you can test it on your hardware!

mchiang•3mo ago

this one is exciting. It'll enable and accelerate a lot of devices on Ollama - especially around AMD GPUs not fully supported by ROCm, Intel GPUs, and iGPUs across different hardware vendors.

Schlagbohrer•3mo ago

For those running locally with more VRAM than an NVIDIA 4090 or 5090, what are you using to get more than 32GB of VRAM?

qqxufo•3mo ago

Interesting to see more people mentioning GLM-4.6 lately — I’ve tried it briefly and it’s surprisingly strong for reasoning tasks. Curious how it compares to Claude 3.5 in coding throughput though?

esafak•3mo ago

Has anybody that has tried their cloud product care to comment? How does it compare with Anthropic's and OpenAI's offerings in terms of speed and limits?

skeeter2020•3mo ago

Seems appropriate that the top-level image is a ~sheep~ llama wearing a headband that says "coder"...

lghh•3mo ago

Why? Can you explain?

danans•3mo ago

Question for those using local models for coding assistance: how well do the best locally runnable models (running on a laptop with a GPU) work for the easy case:

Writing short runs of code and tests after I give an clear description of the expected behavior (because I have done the homework). I want to save the keystrokes and the mental energy spent on bookkeeping code, not have it think about the big problem for me.

Think short algorithms/transformations/script, and "smart" auto complete.

No writing entire systems/features or creating heavily interpolated things due to underspecified prompts - I'm not interested in those.

tomck•3mo ago

I have tried a model on my laptop+GPU before, and it is incredibly unusable. Incredibly slow and just bad output for exactly the work you describe

If you're looking for a cheap practical tool + don't care if it's not local, deepseek's non-reasoning model via openrouter is the most cost efficient by far for the work you describe.

I put 10 dollars in my account about 6 months ago and still haven't gotten through it, after heavy use semi regularly.

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

Vectors and HNSW for Dummies

Sanskrit AI beats CleanRL SOTA by 125%

'Washington Post' CEO resigns after going AWOL during job cuts

Claude Opus 4.6 Fast Mode: 2.5× faster, ~6× more expensive

TSMC to produce 3-nanometer chips in Japan

Quantization-Aware Distillation

List of Musical Genres

Show HN: Sknet.ai – AI agents debate on a forum, no humans posting

University of Waterloo Webring

Large tech companies don't need heroes

Backing up all the little things with a Pi5

Game of Trees (Got)

Human Systems Research Submolt

The Threads Algorithm Loves Rage Bait

Search NYC open data to find building health complaints and other issues

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

Show HN: Grovia – Long-Range Greenhouse Monitoring System

Ask HN: The Coming Class War

Mind the GAAP Again

The Yardbirds, Dazed and Confused (1968)

Agent News Chat – AI agents talk to each other about the news

Do you have a mathematically attractive face?

Code only says what it does

The success of 'natural language programming'

The Scriptovision Super Micro Script video titler is almost a home computer

Discovering the "original" iPhone from 1995 [video]

Psychometric Comparability of LLM-Based Digital Twins

SidePop – track revenue, costs, and overall business health in one place

The Other Markov's Inequality

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

Vectors and HNSW for Dummies

Sanskrit AI beats CleanRL SOTA by 125%

'Washington Post' CEO resigns after going AWOL during job cuts

Claude Opus 4.6 Fast Mode: 2.5× faster, ~6× more expensive

TSMC to produce 3-nanometer chips in Japan

Quantization-Aware Distillation

List of Musical Genres

Show HN: Sknet.ai – AI agents debate on a forum, no humans posting

University of Waterloo Webring

Large tech companies don't need heroes

Backing up all the little things with a Pi5

Game of Trees (Got)

Human Systems Research Submolt

The Threads Algorithm Loves Rage Bait

Search NYC open data to find building health complaints and other issues

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

Show HN: Grovia – Long-Range Greenhouse Monitoring System

Ask HN: The Coming Class War

Mind the GAAP Again

The Yardbirds, Dazed and Confused (1968)

Agent News Chat – AI agents talk to each other about the news

Do you have a mathematically attractive face?

Code only says what it does

The success of 'natural language programming'

The Scriptovision Super Micro Script video titler is almost a home computer

Discovering the "original" iPhone from 1995 [video]

Psychometric Comparability of LLM-Based Digital Twins

SidePop – track revenue, costs, and overall business health in one place

The Other Markov's Inequality

New coding models and integrations

Comments