Root cause: a missing block size division in tensor stride calculation inside create_tensor() in llama-model-loader.cpp. The wrong stride cascades into ggml_nbytes() overflow, exceeding max_buffer_size on 32-bit where size_t is 32-bit.
On 64-bit devices the overflow is silently masked — wrong value but still within GPU memory limits so nobody noticed. Bug has likely been there for years.
Fix and context: https://github.com/Perinban/llama.cpp/tree/axon-dev