Instead of reading benchmark numbers, you can feel how fast or slow different configurations are, by adjusting TTFT, token generation rate, and output length. It streams tokens exactly as an LLM would, but without generating real content.
I was wondering which Apple should I buy and then I did it in the weekend, to better feel what does it mean to run locally a model.
The project/toy is public on github too: https://github.com/htxsrl/localllmsimulation
Thanks to the sources (cited) for the real benchmarks that allowed me to set up a small ML model to fit even futuristic hardware (like an imaginary M9 with 2048 Gb RAM and 3000Gb/s bandwidth).
ndgold•13m ago
Also, I’m seeing check marks next to all quants which confused me a little bit when trying to select.
hertzdog•11m ago
So the check mark simply indicates that the model can actually run under those constraints (fits in memory), not that it’s selected.