Is there a way to subscribe to these blog posts for auto-notification?
Obligatory xkcd: https://xkcd.com/1053/
Also, the training dataset is highly imbalanced and Spanish is the most common class, so the model predicts it as a sort of default when it isn't confident -- this could lead to artifacts in the reduced 3d space.
Turkish and Persian seem to be the nearest neighbors.
I'd suggest training a little less on audio books.
Then I was able to apply UMAP + HDBSCAN to this dataset and it produced a 2D plot of all my books. Later I put the discovered topic back in the db and used that to compute tf-idf for my clusters from which I could pick the top 5 terms to serve as a crude cluster label.
It took about 20 to 30 hours to finish all these steps and I was very impressed with the results. I could see my cookbooks clearly separated from my programming and math books. I could drill in and see subclusters for baking, bbq, salads etc.
Currently I'm putting it into a 2 container docker compose file, base postgresql + a python container I'm working on.
> By clicking or tapping on a point, you will hear a standardized version of the corresponding recording. The reason for voice standardization is two-fold: first, it anonymizes the speaker in the original recordings in order to protect their privacy. Second, it allows us to hear each accent projected onto a neutral voice, making it easier to hear the accent differences and ignore extraneous differences like gender, recording quality, and background noise. However, there is no free lunch: it does not perfectly preserve the source accent and introduces some audible phonetic artifacts.
> This voice standardization model is an in-house accent-preserving voice conversion model.
When I play the different recordings, which I understand have the accent "re-applied" to a neutral voice, it's very difficult to hear any actual differences in vowels, let alone prosody. Like if I click on "French", there's something vaguely different, but it's quite... off. It certainly doesn't sound like any native French speaker I've ever heard. And after all, a huge part of accent is prosody. So I'm not sure what vocal features they're considering as "accent"?
I'm also curious what the three dimensions are supposed to represent? Obviously there's no objective answer, but if they've listened to all the samples, surely they could explain the main constrasting features each dimension seems to encode?
Using the accent guesser, I have a Swedish accent. Danish and Australian English follow as a close tie.
It's not just the AI. Non-native speakers of English often think I have a foreign accent, too. Often they guess at English or Australian. Like I must have been born there and moved here when I was younger, right? I've also been asked if I was Scandinavian.
Interestingly I've noticed that native speakers never make this mistake. They sometimes recognize that I have a speech impediment but there's something about how I talk that is recognized with confidence as a native accent. That leads me to the (probably obvious) inference that whatever it is that non-native speakers use to judge accent and competency, it is different from what native speakers use. I'm guessing in my case, phrase-length tone contour. (Which I can sort of hear, and presumably reproduce well, even if I have trouble with the consonants.)
AI also really has trouble with transcribing my speech. I noticed that as early as the '90s with early speech recognition software. It was completely unusable. Even now AI transcription has much more trouble with me than with most people. Yet aside from a habit of sometimes mumbling, I'm told I speak quite clearly, by humans.
Hearing different things, as it were.
dereknelson•7h ago