Multi-modal audio models are a lot less common. GPT-4o was meant to be able to do this natively from the start but they ended up shipping separate custom models based on it for their audio features. As far as I can tell GPT-5 doesn't have audio input/output at all - the OpenAI features for that still use GPT-4o-audio.
I don't know if Gemini 2.5 (which is multi-modal for vision and audio) shares the same embedding space for all three, but I expect it probably does.
For example, beyond video->text->llm and video->embedding in llm, you can also have an llm controlling/guiding a separate video extractor.
See this paper for a pretty thorough overview.
Tang, Y., Bi, J., Xu, S., Song, L., Liang, S., Wang, T., Zhang, D., An, J., Lin, J., Zhu, R., Vosoughi, A., Huang, C., Zhang, Z., Liu, P., Feng, M., Zheng, F., Zhang, J., Luo, P., Luo, J., & Xu, C. (2025). Video Understanding with Large Language Models: A Survey (No. arXiv:2312.17432). arXiv. https://doi.org/10.48550/arXiv.2312.17432
> "Bonjour, pourriez-vous me dire comment se rendreà la place Tian'anmen?"
translation: "Hello, could you tell me how to get to Tiananmen Square?"
a bold choice!
e.g. if something similar happened in Trafalgar Square, I expect it would still be primarily a major square in London to me, not oh my god they must be referring to that awful event. (In fact I think it was targeted in the 7/7 bombings for example.)
Or a better example to go with your translation - you can refer to the Bastille without 'boldly' invoking the histoire of its storming in the French Revolution.
No doubt the US media has referred to the Capitol without boldness many times since 6 Jan '21.
I wonder if we'll see a macOS port soon - currently it very much needs an NVIDIA GPU as far as I can tell.
I'm pretty happy about that - I was worried it'd be another 200B+.
- additional modalities
- Faster FPS (inferences per second)
- Reaction time tuning (latency vs quality tradeoff) for visual and audio inputs/outputs
- built-in planning modules in the architecture (think premotor frontal lobe)
- time awareness during inference (towards an always inferring / always learning architecture)
It has an entertaining selection of different voices, including:
*Dylan* - A teenager who grew up in Beijing's hutongs
*Peter* - Tianjin crosstalk, professionally supporting others
*Cherry* - A sunny, positive, friendly, and natural young lady
*Ethan* - A sunny, warm, energetic, and vigorous boy
*Eric* - A Sichuan Chengdu man who stands out from the crowd
*Jada* - The fiery older sister from Shanghai
Depending on the architecture this is something you could feasibly have in your house in a couple of years or in an expensive "ai toaster"
Ever since ChatGPT added this feature I've been waiting for anyone else to catch up.
They're are tons of hands free situations like cooking where this would be amazing ("read the next step please, my hands are covered in raw pork", "how much flour for the roux", "crap, I don't have any lemons, what can I substitute")
The Chinese are going to end up owning the AI market if the American labs don't start competing on open weights. Americans may end up in a situation where they have some $1000-2000 device at home with an open Chinese model running on it, if they care about privacy or owning their data. What a turn of events!
Wouldn't worry about that, I'm pretty sure the government is going to ban running Chinese tech in this space sooner or later. And we won't even be able to download it.
Not saying any of the bans will make any kind of sense, but I'm pretty sure they're gonna say this is a "strategic" space. And everything else will follow from there.
Download Chinese models while you can.
chisleu•2h ago
https://www.youtube.com/watch?v=_zdOrPju4_g