E.g.
Privacy
Uptime
Future cost structure controls
This is a field that has moved very quickly. And it has moved in a direction to try to trap users into certain habits. But these habits might not best align with what best benefits end users today or some time in the future.
Apart from that, like detailed in the the article, pricing for local compute also depends on electricity prices.
By the way, I don't want to snark about it, my English is not very good, but it's "per se", not "per say". Just commenting on this petty thing because it seems to be a common misspelling, and it always trips me up a bit. Makes me wonder about another supposed meaning like "from hearsay".
You also get the benefit of privacy, freedom from censorship, and control over the model used (i.e. it will not be rugpulled on you in three months after you've built a workflow around a specific model's idiosyncrasies).
Add to that the privacy improvements and data protection and potentially further specific inferance if needed it's a no brainer.
Again, Ai is a tool, and the right tool for the job, I would wager with no evidence looked up, is that the majority of Devs would be happy with 10-30 per second locally.
But then they talk about using a newly purchased Mac to do the inference, running at full capacity, 24/7. Why would you do that? Apple silicon is fast but the author points out: you're only getting 10-40 tokens per second. It's not bad, but it's not meant for this!
It's comparing apples to oranges. Yeah, data centers don't pay residential electricity rates. Data centers use chips that are power efficient. Data centers use chips that aren't designed to be a Mac.
Apple silicon works out pretty good if you're not burning tokens 24/7/365 and you're not buying hardware specifically to do it. I use my Mac Studio a few times a week for things that I need it for, but I can run ollama on it over the tailnet "for free". The economics work when I'm not trying to make my Mac Studio behave like a H100 cluster with liquid cooling. Which should come as no surprise to anyone: more tokens per watt on hardware that's multi tenant with cheap electricity will pretty much always win.
the less you use local LLM, the less sense it makes since you paid a lot for hardware you don't use
Shortening the lifespan?
But in _every_ metric other than privacy it was better to run via OpenRouter than a local model, and not by a small amount.
Direct link to the comparison charts:
https://sendcheckit.com/blog/ai-powered-subject-line-alterna...
* Industrial power pricing
* Wholesale hardware pricing
* Utilization density of shared API
means API always wins a cost shootout.
Privacy & tinkering is cool too though
tldr;
Hardware deprecation costs are the major factor.
But, if we assume ZERO hardware deprecation (not realistic), then local inference becomes super cheap.. roughly, 90%+ cheaper.
Third case: the break-even happens only if we can get at the very very very least, 8.7 years of useful hardware life. A more realistic number, however, when working 8 hrs/day and not of 24 hrs/day, is around 25 years.
So, for now, local inference is preferable if you deeply care about privacy. From cost perspective, it's still not there.
synthos•34m ago
datadrivenangel•10m ago