Key innovations: Natural Monologue Abandons word-level timestamps and innovatively proposes the “Natural Monologue ” mechanism Preserves the inherent advantages of LLMs in generating coherence and instruction-following. effectively addresses the context-dependent pronunciation issues of certain words (especially numbers).
Dual Training Paradigm Training spans two major stages, four sub-stages, simulating ASR, TTS, and interactive dialog tasks. Post-Training stage equips the model with the basic abilities of “listening” and “speaking”. Supervised Fine-Tuning (SFT) stage then shapes its dialogue and full-duplex interaction capabilities.
Resource Links: https://arxiv.org/abs/2509.02521 https://huggingface.co/CofeAI/FLM-Audio GitHub - cofe-ai/flm-audio: FLM-Audio is a audio-language subversion of RoboEgo/FLM-Ego -- an omnimo
The model is now open-sourced, and we look forward to your use and feedback.