Yes, but it's possible to batch the calls when feeding the data through the neural network, so LLM libraries might support that.
See for example this[1] article which gives a brief overview of batching calls using vLLM.
[1]: https://medium.com/ubiops-tech/how-to-optimize-inference-spe...
al_borland•7mo ago
It sounds like we may have hit similar limits, using slightly different means to get there.