I built an open-source, screen-free, storytelling toy for my nephew who uses a Yoto toy. My sister told me he talks to the stories sometimes and I thought it could be cool if he could actually talk to those characters in stories with AI models (STT, LLM, TTS) running locally on her Macbook and not send the conversation transcript to cloud models.
This is my voice AI stack:
- ESP32 on Arduino to interface with the Voice AI pipeline
- mlx-audio for STT (whisper) and TTS with streaming (`qwen3-tts` / `chatterbox-turbo`)
- mlx-vlm to use vision language models like Qwen3.5-9B and Mistral
- mlx-lm to use LLMs like Qwen3, Llama3.2, Gemma3
- Secure websockets to interface with a Macbook
This repo supports inference on Apple Silicon chips (M1/2/3/4/5) but I am planning to add Windows soon. Would love to hear your thoughts on the project.
akadeb•1h ago
This is my voice AI stack:
- ESP32 on Arduino to interface with the Voice AI pipeline
- mlx-audio for STT (whisper) and TTS with streaming (`qwen3-tts` / `chatterbox-turbo`)
- mlx-vlm to use vision language models like Qwen3.5-9B and Mistral
- mlx-lm to use LLMs like Qwen3, Llama3.2, Gemma3
- Secure websockets to interface with a Macbook
This repo supports inference on Apple Silicon chips (M1/2/3/4/5) but I am planning to add Windows soon. Would love to hear your thoughts on the project.