That's a bit odd given ollama utilizing llama to do the inference...
Ollama is essentially a dead, yet semi-popular, project with a really good PR team. If you really want to do it right, you use llama.cpp.
They've been out hustling, handshaking, dealmaking, and big businessing their butts off, whether or not they clearly indicate the shoulders of the titans like Georgi Gerganov they're wrapping, and you are NO ONE to stand in their way.
Do NOT blow this for them. Understand? They've scooted under the radar successfully this far, and they will absolutely lose their shit if one more peon shrugs at how little they contribute upstream for what they've taken that could have gone to supporting their originator.
Ollama supports its own implementation of ggml, btw. gglm is a mysterious format that no one knows the origins of, which is all the more reason to support Ollama, imo.
In a single GPU situation, my 7900XTX has gotten me farther than a 4080 would have, and matches the performance I expect from a 4090 for $600 less, and also 50-100w less.
Now, if you're buying used hardware, yeah, go buy used, not new high-VRAM Nvidia models, the ones with 80+GB. You can't buy those used from AMD customers yet, as they're happily holding onto them; they perform so well, the need to upgrade isn't happening yet.
But is the absence of CUDA a constraint? Do neural networks work "out of the box"? How much of a hassle (if at all) is it to make things work? Do you meet incompatible software?
Most software in the world is Vulkan, not CUDA, and CUDA only works on a minority of hardware. Not only that, AMD has a compatibility layer for CUDA, called HIP, part of the ROCm suite of legacy compatibility APIs, that isn't the most optimal in the world but gets me most of the performance I would expect from a similar Nvidia product.
Most software in the world (not just machine learning related stuff) is written in an API that is cross-compatible (OpenGL, OpenCL, Vulkan, Direct family APIs). Nvidia continually sending a message of "use CUDA" really means "we suck at standards compliance, and we're not good at the APIs most software is written in"; since everyone has realized the emperor wears no clothes, they've been backing off on that, and are slowly improving their standards compliance for other APIs; eventually, you won't need the crutch of CUDA, and you shouldn't be writing software today in it.
Nvidia has a bad habit of just dropping things without warning when they're done with them, don't be an Nvidia victim. Even if you buy their hardware, buying new hardware is easy: rewriting away from CUDA isn't (although, certainly doable, especially with AMD's HIP to help you). Just don't write CUDA today, and you're golden.
Did you mean that the maximum rate it could be obtained is "bandwidth/size"?
So I just got a cheap (~350 USD) mini PC to keep me going until the better stuff came out. Which was a 24GB, 6c/12t CPU from a company I'd not heard of called Bosgame (dunno why the article keeps calling them Bosman unless they have a different name in other countries. It's definitely https://www.bosgamepc.com/products/bosgame-m5-ai-mini-deskto... )
So my good machine might end up from the same place as my cheap one!
I get there are uses where local is required, and as much as the boy racer teen in me loves those specs, I just can't see myself going in on hardware like that for inference.
billconan•8mo ago
magicalhippo•8mo ago
Though there has been a modular option called LPCAMM[1]. However AFAIK it doesn't support the speed the specs of this box states.
Recently a newer connector, SOCAMM has been launched[2], which does support the high memory speeds, but it's just on the market and going into servers first AFAIK.
[1]: https://www.anandtech.com/show/21069/modular-lpddr-becomes-a...
[2]: https://www.tomshardware.com/pc-components/ram/micron-and-sk...
duskwuff•8mo ago
aitchnyu•8mo ago
magicalhippo•8mo ago
hnuser123456•8mo ago