Quick note on how it works and how I've done my batch embedding engine IgniteMS.
The whole thing runs as one process using Rust, reading input, tokenizing, packing batches, keeping the queue full. TensorRT handles inference. Python is only as a wrapper.
I built it this way because when you use more than couple of GPUs, the GPUs stop being the problem. CPU cannot feed them fast enough. One A100 can go through batches faster than Python can tokenize and feed, so the GPU just sits there idle waiting for work. Most of my time went into optimizing this. At 8 GPUs that was basically the entire challenge.
On cost. I ran the big 2B messages job on a spot p4d instance (8x A100 40GB). After filtering and dedupping I got 685M raw texts. With my new engine the whole production run finishes in about half an hour. Previously I used on-demand for these jobs, now switched to spots. If AWS reclaims the box, I just rerun it. It's roughly $7 for half-an-hour run. And at least right now spots are easier to get than on-demand.
Open warning: it's batch only and NVIDIA only. You can use it both as a docker image and native.
I used some optimizations for my production run. With default settings you can expect to see ~250K msg/sec if you run the benchmark script on your p4d box.
https://github.com/Artain-AI/ignite-ms/blob/main/BENCHMARKIN...
v1.1.0 added TensorRT 11 and 60 models, 23 tested on 1x and 4x A100.
ddayanov•43m ago
The whole thing runs as one process using Rust, reading input, tokenizing, packing batches, keeping the queue full. TensorRT handles inference. Python is only as a wrapper.
I built it this way because when you use more than couple of GPUs, the GPUs stop being the problem. CPU cannot feed them fast enough. One A100 can go through batches faster than Python can tokenize and feed, so the GPU just sits there idle waiting for work. Most of my time went into optimizing this. At 8 GPUs that was basically the entire challenge.
On cost. I ran the big 2B messages job on a spot p4d instance (8x A100 40GB). After filtering and dedupping I got 685M raw texts. With my new engine the whole production run finishes in about half an hour. Previously I used on-demand for these jobs, now switched to spots. If AWS reclaims the box, I just rerun it. It's roughly $7 for half-an-hour run. And at least right now spots are easier to get than on-demand.
Open warning: it's batch only and NVIDIA only. You can use it both as a docker image and native. I used some optimizations for my production run. With default settings you can expect to see ~250K msg/sec if you run the benchmark script on your p4d box. https://github.com/Artain-AI/ignite-ms/blob/main/BENCHMARKIN...
v1.1.0 added TensorRT 11 and 60 models, 23 tested on 1x and 4x A100.
Happy to share details.