Well. It's noteworthy when the price is lower than the cost.
It's not that rare. But it is noteworthy, as it's not sustainable.
But, yes, if the cost to run the service is X and the optimal price is <X, you have a problem.
such insight
Grow at all cost with VC money, grab the market by offering unsustainably cheap prices, and when you have a monopoly, offer a slightly better or slightly worse version of the previous service (but hey, with an app that runs react!), at a slightly higher or same as before price (turns out profitability is important!), with much much worse working conditions for everyone involved (VC needs their money back, has to come from somewhere!)
Then look around confused when there is insane wealth inequality and social unrest.
Best example: flight seats. Economy class fills the plane, but business and first class are the money makers [1].
LLM pricing seems to still very much be up in the air though - models getting more efficient, serving hardware getting more efficient, use cases evolving, and not all providers operating with same business model (e.g. Meta, maybe China).
https://successfulsoftware.net/2013/02/28/how-i-increased-sa...
From the article, it says that it’s a money loser, though, so I suspect that a lot of AI-based businesses run just fine, from the lower-tier price point.
They might want to consider adding an “in-between” pricing tier.
Let's say you are a power user, so your queries and responses are complex and numerous, say 1000 tokens per query+response and 1 query every 10 minutes of an 8h workday. That's 48k tokens per workday, at 20 workdays per month that's 960k tokens per month.
So the cost (not sales price!) for those 960k tokens (roughly 1M) a month should be $4.5
Now you can go over the numbers again and think about where they might be wrong: Maybe a typical query is more than 1000 tokens. Maybe power users issue more queries. You might very well multiply by a factor of 10 here. Nvidia getting more greedy for new GPUs? Add 50%. Data center and power cost too conservative, network and storage also important? Add 50%. 3 years of use for a GPU too long, because the field is very quickly adapting ever larger models? Add 50%. Usage factor not 100%, but lower, say a more realistic 50%? Double the cost. Llama4 not good enough, need a more advanced model? May produce a lot less tokens per GPU-hour, but numbers are hard to come by.
With that, it's easy to imagine that one might still loose money at $200 per month.
To compare, Azure sells OpenAI models in 1M token batches that can easily be compared to the above monthly cost.
https://developer.nvidia.com/blog/blackwell-breaks-the-1000-...
https://azure.microsoft.com/en-us/pricing/details/cognitive-...
It's good it scales down with a higher number of paying subscriptions (each pays a smaller share of training costs).
The claim is per user. With batching, it is MUCH higher (72x)
The very short article [2] linked in [0] which is supposed to be the independent source of those numbers also doesn't specify any details to that effect.
In general, I've learned to treat Nvidia numbers very carefully. They are well-known for misrepresenting apples-to-orange-elefants figures such as comparing FP16, FP8 or FP4 FLOPS, thereby grossly overstating the performance advantages of their new architectures[3].
[0] https://developer.nvidia.com/blog/blackwell-breaks-the-1000-...
[1] > NVIDIA DGX™ GB200 is purpose-built for training and inferencing trillion-parameter generative AI models. Designed as a rack-scale solution, each liquid-cooled rack features 36 NVIDIA GB200 Grace Blackwell Superchips—–36 NVIDIA Grace CPUs and 72 Blackwell GPUs
https://www.nvidia.com/en-eu/data-center/dgx-gb200/
[2] https://www.linkedin.com/feed/update/urn:li:activity:7331470...
[3] https://dev.to/maximsaplin/nvidias-1000x-performance-boost-c...
Sure, but in the article it is already mentioned as 1000 TPS/User for an 8 GPU node, and the rack contains 9 nodes - i.e. 9x more GPUs, not 72x - so the 72000 TPS/Server simply being a multiple of 72 seems like a red herring.
But yeah, I agree that 72x seems high - although only 9x seems low given vLLM showing over 20x speedups with continuous batching. I guess there are a lot of variables.
My queries are like 30,000 tokens input for 50 tokens output
Mentally modelling the pricing as being determined by output doesn't match reality in my experience.
It's also fundamentally different economics since input is gated by VRAM capacity while output is gated by compute.
It costs $200 because the chatty little bot knows a surprising number of things amazingly well, and does decent work pretty darn fast.
Whether that value is worth the money is a different discussion, that is rarely held with big tech offerings.
Right now it costs absolutely more than the subscription price
If it had stayed a non profit, would people have donated enough to keep it in business? Enough people aren’t willing to donate to keep a browser maker in business.
Of course there is a large cost in building a SOTA model in the first place, maybe building your own datacenter(s) for inference too, but compare to something like semiconductor manufacturing where upfront costs are also very high, yet profit margins still reasonable, e.g. ~40% for TSMC who make chips for NVIDIA, AMD, Apple ... As long as there is a possibility of competition (primarily Samsung in this case), then profit margins will be held in check.
joos3•17h ago