Currently, I collect NVML nvmlDeviceGetPowerUsage, polled at 100ms during inference, peak and mean per request, and get this type of data:
model mean-power range (W) spread stdev
qwen3-8b 114.3-121.9 7.6W 1.17
llama-3.1-8b-instruct 104.7-122.1 17.4W 4.29
qwen2.5-1.5b-instruct 53.7-73.0 19.3W 5.23
mistral-7b-instruct-v0.3 96.2-120.0 23.8W 6.01
qwen2.5-7b-instruct 88.7-124.5 35.8W 7.73
gemma-3-1b-it 49.4-56.7 7.3W 2.13
this is per-GPU, single-card data - I don't know whether anything like per-request attribution survives at rack scale, or whether monitoring there happens entirely at the PDU/BMC level instead.