I saw this tweet: "Hear me out: X but it's only voice messages (with AI transcriptions)" - and couldn't stop thinking about it.
So I built VoxConvo.
Why this exists:
AI-generated content is drowning social media. ChatGPT replies, bot threads, AI slop everywhere.
When you hear someone's actual voice: their tone, hesitation, excitement - you know it's real. That authenticity is what we're losing.
So I built a simple platform where voice is the ONLY option.
The experience:
Every post is voice + transcript with word-level timestamps:
Read mode: Scan the transcript like normal text or listen mode: hit play and words highlight in real-time.
You get the emotion of voice with the scannability of text.
Key features:
- Voice shorts
- Real-time transcription
- Visual voice editing - click a word in transcript deletes that audio segment to remove filler words, mistakes, pauses
- Word-level timestamp sync
- No LLM content generation
Technical details:
Backend running on Mac Mini M1:
- TypeGraphQL + Apollo Server
- MongoDB + Atlas Search (community mongo + mongot)
- Redis pub/sub for GraphQL subscriptions
- Docker containerization for ready to scale
Transcription:
- VOSK real time gigaspeech model eats about 7GB RAM
- WebSocket streaming for real-time partial results
- Word-level timestamp extraction plus punctuation model
Storage:
- Audio files are stored to AWS S3
- Everything else is local
Why Mac Mini for MVP? Validation first, scaling later. Architecture is containerized and ready to migrate. But I'd rather prove demand on gigabit fiber than burn cloud budget.
cdrini•1h ago