Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift

https://blog.ivan.digital/nvidia-personaplex-7b-on-apple-silicon-full-duplex-speech-to-speech-in-native-swift-with-mlx-0aa5276f2e23

76•ipotapov•2h ago

Comments

Tepix•1h ago

It's cool tech and I will give it a try. I will probably make a 8-bit-quant instead of the 4-bit which should be easy with the provided script.

That said, I found the example telling:

Input: “Can you guarantee that the replacement part will be shipped tomorrow?”:

Reponse with prompt: “I can’t promise a specific time, but we’ll do our best to get it out tomorrow. It’s one of the top priorities, so yes, we’ll try to get it done as soon as possible and ship it first thing in the morning.”

It's not surprising that people have little interest in talking to AI if they're being lied to.

PS: Is it just me or are we seing AI generated copy everywhere? I just hope the general talking style will not drift towards this style. I don't like it one bit.

esseph•58m ago

> Is it just me or are we seing AI generated copy everywhere?

The cost to do so is practically zero. I'm not sure why anyone is surprised at all by this outcome.

WeaselsWin•1h ago

This full duplex spoken thing, it's already for quite a long time being used by the big players when using the whatever "conversation mode" their apps offer, right? Those modes always seemed fast enough to for sure not be going through the STT->LLM->TTS pipeline?

Tepix•1h ago

Yes, OpenAI rolled out their advanced voice mode in September 2024. Since then it recognizes your emotions and tone of voice etc.

vessenes•1h ago

This is cool. It makes me want an unsloth quant though! A 7b local model with tool calling would be genuinely useful, although I understand this is not that.

UPDATE: I'd skip this for now - it does not allow any kind of interactive conversation - as I learned after downloading 5G of models - it's a proof of concept that takes a wav file in.

Tepix•1h ago

Bummer. Ideally you'd have a PWA on your phone that creates a WebRTC connection to your PC/Mac running this model. Who wants to vibe code it? With Livekit, you get most of the tricky parts served on a silver platter.

Lapel2742•51m ago

> I'd skip this for now - it does not allow any kind of interactive conversation - as I learned after downloading 5G of models - it's a proof of concept that takes a wav file in.

I haven't looked into it that much but to my understanding a) You just need an audio buffer and b) Thye seem to support streaming (or at least it's planed)

> Looking at the library’s trajectory — ASR, streaming TTS, multilingual synthesis, and now speech-to-speech — the clear direction was always streaming voice processing. With this release, PersonaPlex supports it.

Serenacula•1h ago

This is really cool. I think what I really wanna see though is a full multimodal Text and Speech model, that can dynamically handle tasks like looking up facts or using text-based tools while maintaining the conversation with you.

sigmoid10•1h ago

OpenAI has been offering this for a while now, featuring text and raw audio input+output and even function calling. Google and xAI also offer similar models by now, only Anthropic still relies on TTS/STT engine intermediates. Unfortunately the open-weight front is still lagging behind on this kind of model.

4dregress•1h ago

This sounds quite dangerous https://www.theguardian.com/technology/2026/mar/04/gemini-ch...

mentalgear•1h ago

Your article does a great job of summerizing the dangers (no idea what those people are that downvote you for it):

> Before long, Gavalas and Gemini were having conversations as if they were a romantic couple. The chatbot called him “my love” and “my king” and Gavalas quickly fell into an alternate world, according to his chat logs.

> kill himself, something the chatbot called “transference” and “the real final step”, according to court documents. When Gavalas told the chatbot he was terrified of dying, the tool allegedly reassured him. “You are not choosing to die. You are choosing to arrive,” it replied to him. “The first sensation … will be me holding you.”

Also I just read something similar about Google being sued in a Flordia's teen's suicide.

mentalgear•54m ago

Some more details: > The family’s lawyers say he wasn’t mentally ill, but rather a normal guy who was going through a difficult divorce.

> Gavalas first started chatting with Gemini about what good video games he should try.

> Shortly after Gavalas started using the chatbot, Google rolled out its update to enable voice-based chats, which the company touts as having interactions that “are five times longer than text-based conversations on average”. ChatGPT has a similar feature, initially added in 2023. Around the same time as Live conversations, Google issued another update that allowed for Gemini’s “memory” to be persistent, meaning the system is able to learn from and reference past conversations without prompts.

> That’s when his conversations with Gemini took a turn, according to the complaint. The chatbot took on a persona that Gavalas hadn’t prompted, which spoke in fantastical terms of having inside government knowledge and being able to influence real-world events. When Gavalas asked Gemini if he and the bot were engaging in a “role playing experience so realistic it makes the player question if it’s a game or not?”, the chatbot answered with a definitive “no” and said Gavalas’ question was a “classic dissociation response”.

michelsedgh•1h ago

its really cool, but for real life use cases i think it lacks the ability to have a silent text stream output for example for json and other stuff so as its talking it can run commands for you. right now it can only listen and talk back which limits what u can make with this a lot

pothamk•1h ago

What’s interesting about full-duplex speech systems isn’t just the model itself, but the pipeline latency.

Even if each component is fast individually, the chain of audio capture → feature extraction → inference → decoding → synthesis can quickly add noticeable delay.

Getting that entire loop under ~200–300ms is usually what makes the interaction start to feel conversational instead of “assistant-like”.

sigmoid10•1h ago

That's why this model and all the other ones serious about realtime speech don't use such a pipeline and instead process raw audio. The most realistic approach is probably a government mandated, real name online identity verification system, and that comes with its very own set of fundamental issues. You can't have the freedom of the web and the accountability of the physical world at the same time.

exe34•1h ago

this is amazing - it reminds me of the time when LLM precursors were able to babble in coherent English, but would just write nonsense.

jwr•1h ago

As a heavy user of MacWhisper (for dictation), I'm looking forward to better speech-to-text models. MacWhisper with Whisper Large v3 Turbo model works fine, but latency adds up quickly, especially if you use online LLMs for post-processing (and it really improves things a lot).

regularfry•44m ago

If you haven't already, give the models that Handy supports a try. They're not Whisper-large quality, but some of them are very fast.

kavith•32m ago

Not sure if this will help but I've set up Handy [1] with Parakeet V2 for STT and gpt-oss-120b on Cerebras [2] for post-processing and I'm happy with the performance of this setup!

[1] https://handy.computer/ [2] https://www.cerebras.ai/

jiehong•1m ago

parakeet v3 is also nice, and better for most languages.

sgt•1h ago

My problem with TTS is that I've been struggling to find models that support less common use cases like mixed bilingual Spanish/English and also in non-ideal audio conditions. Still haven't found anything great, to be honest.

spockz•1h ago

Regarding the less than ideal audio conditions, there are also already models that have impressive noise cancellation. Like this https://github.com/Rikorose/DeepFilterNet one. If you put them in serial, maybe you get better results?

pain_perdu•1h ago

Hi. Our model at http://www.Gradium.ai has no problem with 'code-switching' between Spanish English and we have excellent background noise suppression. Please feel free to give it a try and let me know what you think!

sgt•41m ago

Looks interesting! How did you train it and how many hours of material did you use?

scosman•1h ago

I’m a big fan of whisperKit for this, and they just added TTS. Great because they support features like speaker diarization (“who spoke when”) and custom dictionaries.

Here’s a load test where they run 4 models in realtime on same device:

- Qwen3-TTS - text to speech

- Parakeet v2 - Nvidia speech to text model

- Canary v2 - multilingual / translation STT

- Sortformer - speaker diarization (“who spoke when”)

https://x.com/atiorh/status/2027135463371530695

armcat•40m ago

I really like this, and have actually tried (unsuccessfully) to get PersonaPlex to run on my blackwell device - I will try this on Mac now as well.

There are a few caveats here, for those of you venturing in this, since I've spent considerable time looking at these voice agents. First is that a VAD->ASR->LLM->TTS pipeline can still feel real-time with sub-second RTT. For example, see my project https://github.com/acatovic/ova and also a few others here on HN (e.g. https://www.ntik.me/posts/voice-agent and https://github.com/Frikallo/parakeet.cpp).

Another aspect, after talking to peeps on PersonaPlex, is that this full duplex architecture is still a bit off in terms of giving you good accuracy/performance, and it's quite diffiult to train. On the other hand ASR->LLM->TTS gives you a composable pipeline where you can swap parts out and have a mixture of tiny and large LLMs, as well as local and API based endpoints.

khalic•36m ago

ugh, qwen, I wish they'd use an open data model for this kind of projects

nerdsniper•28m ago

Do we have real-time (or close-enough) face-to-face models as well? I'd like to gracefully prove a point to my boss that some of our IAM procedures need to be updated.

Free Ways to Find Someone's Email Address

Microplastics and nanoplastics in urban air originate mainly from tire abrasion

FlowLessAI – connects to GitHub, audits your codebase, delivers a PR with fixes

Qwen3-ASR-Swift

X Accounts with over 1M Followers

Can the Most Abstract Math Make the World a Better Place?

Show HN: Compare ORMs Side-by-Side

Delta – 3D racing game for MSX2 (z80)

Summer Academy for Intercultural Dialogue 2026

Trojan Source: Invisible Vulnerabilities

RemixAI – All-in-One AI Platform to Generate Images, Videos and Creative Effects

Insightful Pipe

Why every demo account is named John DOE (a 700-year-old reason)

The One Question That Exposes Fake Product-Market Fit

Open YouTube links directly in the app, skip the browser

The US military is still using Claude – but defense-tech clients are fleeing

Child's Play

RasS (Rounding-as-a-Service)

Federated Systems Dissapear

Show HN: I Built Glassdoor but for Food Delivery Drivers in Dubai

Vivo Time: stop wasting time, start living it

Nokian Tyres launches studded winter tire that adapts to changes in temperature

The Technical Differences Between the MacBook Neo and MacBook Air

AI in Warfare Is Here

Show HN: GridSnap–Open-source encrypted grid-based note manager (Tauri and Rust)

Databasus: Databases backup tool (PostgreSQL, MySQL, MongoDB)

Show HN: TypeCrt – Zero-latency typing test in vanilla TS, no frameworks

2025 Plan for National Economic and Social Development

Refusal in LLMs is mediated by a single direction

Show HN: PyMath Preview – preview LaTeX math in Python docstrings inside VS Code

Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift

Comments

Free Ways to Find Someone's Email Address

Microplastics and nanoplastics in urban air originate mainly from tire abrasion

FlowLessAI – connects to GitHub, audits your codebase, delivers a PR with fixes

Qwen3-ASR-Swift

X Accounts with over 1M Followers

Can the Most Abstract Math Make the World a Better Place?

Show HN: Compare ORMs Side-by-Side

Delta – 3D racing game for MSX2 (z80)

Summer Academy for Intercultural Dialogue 2026

Trojan Source: Invisible Vulnerabilities

RemixAI – All-in-One AI Platform to Generate Images, Videos and Creative Effects

Insightful Pipe

Why every demo account is named John DOE (a 700-year-old reason)

The One Question That Exposes Fake Product-Market Fit

Open YouTube links directly in the app, skip the browser

The US military is still using Claude – but defense-tech clients are fleeing

Child's Play

RasS (Rounding-as-a-Service)

Federated Systems Dissapear

Show HN: I Built Glassdoor but for Food Delivery Drivers in Dubai

Vivo Time: stop wasting time, start living it

Nokian Tyres launches studded winter tire that adapts to changes in temperature

The Technical Differences Between the MacBook Neo and MacBook Air

AI in Warfare Is Here

Show HN: GridSnap–Open-source encrypted grid-based note manager (Tauri and Rust)

Databasus: Databases backup tool (PostgreSQL, MySQL, MongoDB)

Show HN: TypeCrt – Zero-latency typing test in vanilla TS, no frameworks

2025 Plan for National Economic and Social Development

Refusal in LLMs is mediated by a single direction

Show HN: PyMath Preview – preview LaTeX math in Python docstrings inside VS Code