I built EchoEntry (https://echoentry.ai) – a speech-to-text API optimized specifically for digits.
The problem: Generic STT APIs struggle with numbers. "One oh five" becomes "105" sometimes, "15" other times. For healthcare apps, warehouse systems, or IVR, this inconsistency breaks workflows.
My solution: Fine-tuned Whisper-small on 1-999 spoken numbers across 5 English accents. Gets 95% accuracy on 1-3 digit numbers.
Tech stack: - Custom Whisper model (1.7GB) - FastAPI backend - Deployed on 8GB Linode - FFmpeg for audio processing
Try it now (two commands, no signup):
# Download test audio curl -O https://echoentry.ai/test_audio.wav
# Test the API curl -X POST https://api.echoentry.ai/v1/transcribe \ -H "X-Api-Key: demo_key_12345" \ -F "file=@test_audio.wav;type=audio/wav"
Currently free beta (1,000 calls/month per key). Looking for feedback on: 1. What accuracy threshold makes this production-ready for you? 2. Are there other number-heavy use cases I'm missing? 3. Would you pay for this vs. using generic STT?
Docs: https://echoentry.ai/docs.html
Happy to answer technical questions about the fine-tuning process or deployment!