Along similar lines, it would be useful to map a speaker's vowels in vowel-space (and likewise for consonants?) to compare native to non-native speakers.
I can't wait until something like this is available for Japanese.
.. unless they had access to a native speaker and/or vocal coach? While an automated Henry Higgins is nifty, it's not something humans haven't been able to do themselves.
You can measure this by mutual intelligibility with other accent groupings.
As a learning platform that provides instruction to our users, we do need to set some kind of direction in our pedagogy, but we 100% recognize that there isn't just 1 American English accent, and there's lots of variance.
But then I read their privacy policy. They want permission to save all of my audio interactions for all eternity. It's so sad that I will never try out their (admittedly super cool) AI tech.
Yeah, I can opt out. By not using any voice-related feature in their voice training app.
A suggestion and some surprise: I’m surprised by your assertion that there’s no clustering. I see the representation shows no clustering, and believe you that there is therefore no broad high-dimensional clustering. I also agree that the demo where Victor’s voice moves closer to Eliza’s sounds more native.
But, how can it be that you can show directionality toward “native” without clustering? I would read this as a problem with my embedding, not a feature. Perhaps there are some smaller-dimensional sub-axes that do encode what sort of accent someone has?
Suggestion for the BoldVoice team: if you’d like to go viral, I suggest you dig into American idiolects — two that are hard not to talk about / opine on / retweet are AAVE and Gay male speech (not sure if there’s a more formal name for this, it’s what Wikipedia uses).
I’m in a mixed race family, and we spent a lot of time playing with ChatGPT’s AAVE abilities which have, I think sadly, been completely nerfed over the releases. Chat seems to have no sense of shame when it says speaking like one of my kids is harmful; I imagine the well intentioned OpenAI folks were sort of thinking the opposite when they cut it out. It seems to have a list of “okay” and “bad” idiolects baked in - for instance, it will give you a thick Irish accent, a Boston accent, a NY/Bronx accent, but no Asian/SE Asian accents.
I like the idea of an idiolect-manager, something that could help me move my speech more or less toward a given idiolect. Similarly England is a rich minefield of idiolects, from scouse to highly posh.
I’m guessing you guys are aimed at the call center market based on your demo, but there could be a lot more applications! Voice coaches in Hollywood (the good ones) charge hundreds of dollar per hour, so there’s a valuable if small market out there for much of this. Thanks for the demo and write up. Very cool.
We're back to "AI safety actually means brand safety": inept pushback against being made into an automated racism factory with their name on it.
I sometimes see content on social media encouraging people to sound more native or improve their accent. But IMO it's perfectly ok to have an accent, as long as the speech meets some baseline of intelligibility. (So Victor needs to work on "long" but not "days".) I've even come across people who are trying to mimick a native accent but lose intelligibility, where they'd sound better with their foreign accent. (An example I've seen is a native Spanish speaker trying to imitate the American accent's intervocalic T and D, and I don't understand them. A Spanish /t/ or /d/ would be different from most English language accents, but be way more understandable.)
Indeed Victor would likely receive a personalized lesson and practice on the NG sound on the app.
It’s also perfectly fine to want to sound like a native speaker - whether it be because they are self conscious, think it will benefit them in some way, or simply want to feel like they are speaking “correctly”
Sorry to pick on you, it’s just amazing to me how sensitive we are to “inclusivity” to the point where we almost discourage people wanting to fit in
That group has a vast range of accents, but it's believable that that range occupies an identifiable part of the multi-dimensional accent space, and has very little overlap with, for example, beginner ESL students from China.
Even between native speakers, I bet you could come up with some measure of centrality and measure accent strength as a distance from that. And if language families exist upon a continuum - there must be some point on that continuum where you are no longer speaking English, but say Scots or Friesian or Nigerian Creole instead. Accents close to those points are objectively stronger.
But there is a lot of freedom in how you measure centrality - if you weight by number of speakers, you might expect to get some mid-American or mid-Atlantic accent, but wind up with the dialect of semi-literate Hyderabad call centre workers.
> Even between native speakers, I bet you could come up with some measure of centrality and measure accent strength as a distance from that
Is that what BoldVoice is actually doing? At least from the article is saying, it is measuring the strength of the user's American English accent (maybe GenAm?), and there is no discussion of any user choice of native accent to target.
No, I don't think it is doing that, I'm just taking issue with cccpurcell, who seems to believe that any definition of accent strength is chauvinistic.
I’ve been using it for a few months, and I can confirm it’s working.
Also, the USA writing convention falls short, like "who put the dot inside the string."
crazy. Rationals "put the dot after the string". No spelling corrector should change that.
I assume that, with enough training, we could get similarly accurate guesses of a person's linguistic history from their voice data.
Obviously it would be extremely tricky for lots of people. For instance, many people think I sound English or Irish. I grew up in France to American parents who both went to Oxford and spent 15 years in England. I wouldn't be surprised, though, if a well-trained model could do much better on my accent than "you sound kinda Irish."
I had a forensic linguistics TA during college who was able to identify the island in southeast Asia one of the students grew up on, and where they moved to in the UK as a teenager before coming to the US (if I am remembering this story right).
From what I gather, there are a lot of clues in how we speak that most brains edit out when parsing language.
This kind of speech clustering has been possible for years - the exciting point with their model here is how it's highly focused on accents alone. Here's a video of mine from 2020 that demonstrated this kind of voice clustering in the Mozilla TTS repo (sadly the code got broken + dropped after a refactoring). Bokeh made it possible to directly click on points in a cluster and have them play
https://youtu.be/KW3oO7JVa7Q?si=1w-4pU5488WxYL3l
note: take care when listening as the audio level varies a bit (sorry!)
Just had an employee at our company start expensing BoldVoice. Being able to be understood more easily is a big deal for global remote employees.
(Note - I am a small investor in BoldVoice)
I’d consider making this feature available free with super low friction, maybe no signup required, to get some viral traction.
This is offensive :))
If so—and if you want to transfer-learn new downstream models from embeddings—then seems to me you are onto a very effective way of doing data augmentation. It's expensive to do data augmentation on raw waveforms since you always need to run the STFT again; but if you've pre-computed & cached embeddings and can do data augmentation there, it would be super fast.
I’d be really interested to play with this tool and see what it thinks of my accent. Can it tell where I grew up? Can it tell what my parents’ native languages are (not English!)
A free tool like this would be great marketing for this company.
treetalker•6h ago
That said, I found the recording of Victor's speech after practicing with the recording of his own unaccented voice to be far less intelligible than his original recording.
Looking forward to seeing the developments in this particular application.
ilyausorov•4h ago
Interesting to note that we're also developing a separate measure of intelligibility that will give a separate sense of how intelligible versus accented something is.