Running Phi-3 via Candle for compression. Fast enough. Cloud calls are shorter. Haven't done real token counting yet.
The expansion step is the problem. Re-hydrating caveman output into readable text is harder than compressing the input and the local model makes more mistakes there. Not sure if that's a prompting issue or just a ceiling for a model this size.
Also not sure this makes sense at low API volumes. The added complexity might not be worth it.