I was having issues with the cost to service users with transcriptions with costs ranging from $0.013 to $3 per hour with live text.
I ended up shifting over to a WASM for scribing. There are general models for listening and a medical specific model for dictation.
This reduced our API needs to just transforming the transcript - for which I am using an open source model instead of APIs from hyper scalers, claude, open ai etc.
It burns through your battery so most people use it on a desktop. But a benefit is the audio doesn't get sent outside the clinic's computer.
Try it out. I think there will be a decline in inference needs as models get better and so does the engineering around costs.
Feel free to share with your clinician friends.