Thanks! Any idea why I'm getting such poor performance on these new models? Whether Small or Tiny, on my 24GB 7900XTX I'm seeing like 8 tokens/s using the latest llama.cpp with vulkan. Even if it was running 4x faster than this I would be asking why I'm getting so few tokens/s when it sounds like the models are supposed to bring increased inference efficiency.
danielhanchen•4mo ago
Oh I think its a Vulcan backend issue - someone raised it with me and said the rocm backend is much faster
danielhanchen•4mo ago
CMay•4mo ago
danielhanchen•4mo ago