I'm mainly using GLM-4.7 these days, because of a subscription that seemed like a pretty good deal (fingers crossed Z.ai / Zhipu survive the year or this will suck a little bit). It was nice and fast over the holidays, and it's much slower now. This is cool to see, but man, but I'm pretty cost conscious and I don't think I'll reach for this often. But I hope it's an option I can reach for!!
Input tokens are really expensive here, relative to their other models & the market rate for input tokens for GLM-4.7. $2.25/M tokens is ~4x what most charge, is ~3x their next most expensive model Llama-3.3-70b. It's also advertised as half as fast as Llama. Output tokens are a little more expensive than market at $2.75 vs ~$2, but really not bad.
Cerberas is getting rid of their Qwen3-235B at the end of the month. There's a very affordable GPT OSS 120B that's incredibly fast and cheap, 3000t/s, $0.35/$0.75M I/O! It'd be great to see something like MiniMax, which is supposedly very cheap to run on GPU, if that can be ported.
jauntywundrkind•14h ago
Input tokens are really expensive here, relative to their other models & the market rate for input tokens for GLM-4.7. $2.25/M tokens is ~4x what most charge, is ~3x their next most expensive model Llama-3.3-70b. It's also advertised as half as fast as Llama. Output tokens are a little more expensive than market at $2.75 vs ~$2, but really not bad.
Cerberas is getting rid of their Qwen3-235B at the end of the month. There's a very affordable GPT OSS 120B that's incredibly fast and cheap, 3000t/s, $0.35/$0.75M I/O! It'd be great to see something like MiniMax, which is supposedly very cheap to run on GPU, if that can be ported.