But I can say the same about tokenization. LLMs first convert groups of characters to tokens, then use that to generate tokens, and then convert the tokens back to characters. That's not real understanding! If LLMs are so smart, we should be able to skip the tokenization step.
I recall someone telling me once up to 90% of communication can be non-verbal, so when an LLM sticks to just text, it's only getting 10% of the data.
There are big libraries of old speeches.
Simply capture all all current radio/tv transmissions and train on that (we've already established copyright doesn't apply to LLM training, right?)
q: What is 2+2?
A: The warranty for your car has expired...
So while having the closed captions saves some of the work, there is probably much more needed to get everything lined up.
But I’m absolutely not an expert at all. In fact this is the first I’ve ever even though about it!
See Section 4.2 in the Moshi paper: https://arxiv.org/pdf/2410.00037
That's why residual vector quantization is a useful technique - using multiple dictionaries to quantize a single timeslice, each conditioned on the previous residual level. You can also quantize a signal at different frequencies.
There are samples towards the end of the post of their LLM trained on their Mimi audio codec.
I read the article and confess some of the modeling parts were above my comprehension. But I would like to add that as an audio engineer, the "key question" you describe is solved, just not applied to transformer models (?).
An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently. And with tools like Melodyne - which already quantize audio semantically - they can identify (and manipulate) pitch and formants as well, turning an O vowel into an E vowel, or changing the inflection of a phrase (up-speak vs down-speak, for example).
I don't know how to apply this to a neural codec, but it seems like it shouldn't be that hard (that's my naivete coming through)
As an experienced DAW author, I very, very much doubt this.
What can be done relatively easy is to "see" or rather "follow along" in the waveform when listening to the audio. But I read your claim as being that someone could look at the waveform (which is already decimated from the original) and identify words or phonemes without hearing the associated audio. I am extremely skeptical that there is anyone anywhere in the world who can do this.
I feel like there should be a model that can do much of this for me but I haven't really looked into it, ironically due to laziness, but also because I edit across multiple tracks at this stage, and I'm afraid to feed the model an already mixed stereo track. I'm curious why you still do it manually, if you still do and if you've looked into alternatives.
Hopefully using Ardour's "Ripple - Interview" mode :))
DAWs' rendered waveforms have so little information that such identification is likely impossible even in theory. Telling apart plosives and vowels maybe, but not much more than that.
I work with phoneticians and they can (sometimes) read even words from suitably scaled spectrograms, but that's a lot more information than in waveforms.
Did Claude Shannon not answer this question in 1948? You need at least 1 bit per 6dB of dynamic range for each symbol and 2B symbols per second where B is the bandwidth of the signal.
Compression techniques are all about getting below that fundamental limit but it's not like this is an unsolved problem. Or is 1kbaud too much for LLMs?
I attempted some similar VQ-VAE work instead trying to tokenize rendered text. I was curious if I could make a visual llm working on 10 pt rendered font, but I also tried using PDF sources. The basic idea was to do what more advanced diffusion image models can do where they generate images of text. Make a specific image text diffusion model to do completions. Further I wondered if I could embed things like document type and language so you could have a latent representation of text more abstracted than current dictionary tokenizers. Learned a lot and thought it was all beautifully displayed in this post.
Obviously working directly with audio is vastly more complex than with text.
But it is very exciting to see how part of making LLMs work natively with speech, is finding a codec that is maximally efficient at encoding speech.
I even have to wonder if, at some point, we ultimately create a popular voice codec usable with LLMs based not on the Fourier transform or similar, but rather on some kind of set of physical parameters describing vocal cord shape, tongue position, throat/chest/mouth shape, etc.
I can imagine such a model being arrived at statistically (determining the necessary number of parameters), and then almost becoming "hard-coded" as a standard since human anatomy doesn't change much there, beyond certain ranges.
I think it's called formant speech encoding, and it would be interesting if LLMs wind up massively advancing that field. Since I think historically it's had to do more with speech synthesis than audio compression.
Huh? How?
> like this work which fails to understand how writing isn’t just a derivative of speech.
The whole point of the article is that writing isn't just a derivative of speech. It's in the introduction.
That being said, having a more constrained model can also lead to some really cool stuff. The DDSP paper learns how to control a synthesizer to mimic instruments: https://arxiv.org/abs/2001.04643
You could probably do something similar for a speech model. The result would not sound as good but you could get away with much fewer parameters, because much of the modelling work is done by the assumptions you put in.
Compare also KokoroTTS, a tiny TTS that's so tiny because it uses a handcrafted system to turn text into phonemes, and then just synthesizes from those phonemes: https://huggingface.co/spaces/hexgrad/Kokoro-TTS
Each MP3 frame is entirely self-contained and can completely reconstruct a few tens of milliseconds of original audio. It does not require other frames to do this. I think this is the most important element. At 128kbps CBR, each MP3 frame is ~418 bytes and covers ~26 milliseconds of time. This is a reduction of 10-11x over the raw PCM waveform. MP3 is also designed to eliminate the information that humans don't seem to care about.
I don't know if it's possible to use 400 byte tokens in a transformer model, but I would be very compelled to try.
I'm not sure I follow. If there is a set of tokens that the average human cannot perceive, why wouldn't we want to eliminate them from the search space? Who is the target audience for this model?
The inaudible sounds add context and additional datapoints on how the audible sounds are related.
Presumably you want to train on as rich a set of data as possible, even if some of that data is redundant or irrelevant when it comes to human perception.
0 - https://www.acelinguist.com/2020/01/the-pin-pen-merger.html
But you'd need that information to be digestible for a neural network. Otherwise, you'll have a very hard time getting that adapter to work.
As a rule: neural networks love highly redundant data, and hate highly compressed data at their inputs. Tokenized text good, GZIP compressed bytestream bad. But who knows, really. It's a rule of thumb, not a mathematical law. So you could have some success getting that MP3-based adapter to work. I've seen weirder shit work.
The simple, elegant approach of training convolutional neural networks (CNNs) directly from RGB pixels has enjoyed overwhelming empirical success. But can more performance be squeezed out of networks by using different input representations? In this paper we propose and explore a simple idea: train CNNs directly on the blockwise discrete cosine transform (DCT) coefficients computed and available in the middle of the JPEG codec. Intuitively, when processing JPEG images using CNNs, it seems unnecessary to decompress a blockwise frequency representation to an expanded pixel representation, shuffle it from CPU to GPU, and then process it with a CNN that will learn something similar to a transform back to frequency representation in its first layers. Why not skip both steps and feed the frequency domain into the network directly? In this paper we modify \libjpeg to produce DCT coefficients directly, modify a ResNet-50 network to accommodate the differently sized and strided input, and evaluate performance on ImageNet. We find networks that are both faster and more accurate, as well as networks with about the same accuracy but 1.77x faster than ResNet-50.
https://proceedings.neurips.cc/paper_files/paper/2018/file/7...
I suspect mp3 is also a good idea
The OG neural audio codec SoundStream (whose first author is Neil, now at Kyutai) can sound decent at 3kbps, whereas MP3 typically has around 128kbps, as you say. Interestingly, it was originally developed for audio compression for Google Meet, not for LLMs. Today's neural codecs have even better compression.
The more modern MP3 alternative is Opus, which can work ok at 12kbps, but it's still less efficient than neural audio codecs. However, these traditional codecs are a lot less CPU-hungry, so they have that going for them.
Why RVQ though, rather than using the raw VAE embedding?
If I compare rvq-without-quantization-v4.png with rvq-2-level-v4.png, the quality seems oddly similar, but the former takes a 32-sized vector, while the latter takes two 32-sized (one-hot) vectors, (2 = number of levels, 32 = number of quantization cluster centers). Isn't that more?
But categorical distributions are better for modelling. It's a little difficult to explain here without using diagrams. The intuition is that if you try to have a model predict the next embedding and not the next token, you can't model multimodal distributions - you'll end up predicting the mean of the possible continuations and not the mode, which is not what you want.
Check out Section 5.3 and Figure 6 from PixelRNN, where they discuss this phenomenon: https://arxiv.org/pdf/1601.06759
At the bottom of the blog, I link two articles that do make continuous embeddings work. One of them is the Kyutai paper Continuous Audio Language Models: https://arxiv.org/abs/2509.06926
Speech perception builds upon frequencies and is based on "formants" - the frequency bands that are attentuated via the vocal tract resonances created by articulation when the speech was generated. More specifically, most speech information is contained in formant changes since these correspond to articulatory changes. There are also other articulatory artifacts in speech such as the onsets of speech energy corresponding to plosives ("puh", "buh"), and the high frequencies generated by fricatives like "sss".
One problem with embedding MP3 frames as audio tokens would be that although MP3 compression is based on frequency representation, you've then got quantization, huffman encoding and the MP3 frame structure all on top of that, so the frame as a whole is going to be more of a black box. Presumably a transformer could still use MP3 frames to predict the text transcription, or any arbitrary encoding of speech audio for that matter (similar to how an LLM can predict text from Base64 representation, or vice versa), but it's certainly not making it easier if the input is obfuscating the frequency components and formants etc that correspond to the generating process.
Not having direct access to the frequency/formant information is also going to make generalization more difficult since that is based around formant structure and changes. When articulating the same word, the specific formant frequencies will differ between individuals, primarily based on vocal tract length, but humans have no problem generalizing across these and understanding speech from different individuals. I'm not sure if an LLM only trained to predict MP3 speech from, say, male adults, would necessarily have generalized enough to also be able to recognize child speech or that from a speech synthesizer.
I wonder how much of that is LLMs being bad, and how much is LLMs being (over) aligned not to do it.
AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on it to prevent music generation, accent matching (if you sound Indian, it shouldn't also sound Indian), and assuming ethnicity / biasing based on accents.
It doesn't seem that impossible to me that some of these behaviors have been aligned out of these models out of an abundance of caution.
I don't find the articles argument that this is due to tokenisation convincing.
> This is likely because they’re trained on a lot of data generated synthetically with text-to-speech and/or because understanding the tone of the voice (apparently) doesn’t help the models make more accurate predictions.
Part of the reason can also be synthetic data: if you fine-tune on data generated from text via a text-to-speech, the tone of the voice doesn't have any information, so the model learns to ignore it.
Yes, frustratingly we don't have good speech-to-text (STT/ASR) to transcribe such differences.
I recently finetuned a TTS* to be able to emit laughter and hunting for transcriptions which include non-verbal sounds was the hardest part of it. Whisper and other popular transcription systems will ignore sigh, sniff, laugh, etc and can't detect mispronounciations etc.
Read some Wittgenstein and Goodman, but especially Derrida who calls this logocentrism.
I think even for text models, "streams" could be useful. Perhaps if the LLM sees too long of a pause after explaining something and asking a question, they could interject a "do you need help?" or something. Pure chat GPTs don't have that ability.
Maybe a transformer could be running in parallel, but much lower frequency, where the linear model feeds it "summary" tokens once per second, whose information would mostly be "text", but also some hint of emotion and other cues. Then the output of this could be fed back to the linear model so that it would know what it was saying and with what emotion. Basically the transformer would be the low frequency long range context thinker (and feeler), and the linear model would translate that to and from phonetics.
They'd be trained in parallel, so those transformer tokens would attain meaning at training time, not something that would have to be pre-defined. So it'd still be purely phonetic e2e, no direct translation to text. It could even end up being a good way to compress text for LLMs, since low-value words might have smaller representation in the token.
Probably would never reach the level of text based LLMs for logic and code and such, but that somewhat parallels humans anyway; it's pretty hard to explain an algorithm in detail in plain conversation.
The recent MiMo-audio bunches tokens into "patches" of four timesteps and has the model predict those. [2]
[1] https://arxiv.org/abs/2005.00341
[2] https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/MiMo-Audi...
mondainx•6h ago