Context:
Use cases: offline chatbots, smart cameras, local data privacy
Models: 7–13B parameter quantized models (e.g. Llama 2, Vicuna)
Constraints: limited RAM/flash, CPU-only or tiny GPU, intermittent connectivity
Questions:
What runtimes or frameworks are you using (ONNX Runtime, TVM, custom C++)?
How do you handle model loading, eviction, and batching under tight memory?
Any clever tricks for quantization, pruning, or kernel fusions that boost perf?
How do you monitor and update models securely in the field?
Looking forward to your benchmarks, war stories, and code pointers!
byte-bolter•2h ago