Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift

Comments

By armcat 2026-03-059:237 reply

I really like this, and have actually tried (unsuccessfully) to get PersonaPlex to run on my blackwell device - I will try this on Mac now as well.

There are a few caveats here, for those of you venturing in this, since I've spent considerable time looking at these voice agents. First is that a VAD->ASR->LLM->TTS pipeline can still feel real-time with sub-second RTT. For example, see my project https://github.com/acatovic/ova and also a few others here on HN (e.g. https://www.ntik.me/posts/voice-agent and https://github.com/Frikallo/parakeet.cpp).

Another aspect, after talking to peeps on PersonaPlex, is that this full duplex architecture is still a bit off in terms of giving you good accuracy/performance, and it's quite diffiult to train. On the other hand ASR->LLM->TTS gives you a composable pipeline where you can swap parts out and have a mixture of tiny and large LLMs, as well as local and API based endpoints.

By nowittyusername 2026-03-0510:053 reply

I've been working on building my own voice agent as well for a while and would love to talk to you and swap notes if you have the time. I have many things id like to discuss, but mainly right now im trying to figure out how a full duplex pipeline like this could fit in to an agentic framework. Ive had no issues with the traditional route of stt > llm > tts pipeline as that naturally lends itself with any agentic behavior like tool use, advanced context managemnt systems, rag , etc... I separate the human facing agent from the subagent to reduce latency and context bloat and it works well. While I am happy with the current pipeline I do always keep an eye out for full duplex solutions as they look interesting and feel more dynamic naturally because of the architecture, but every time i visit them i cant wrap my head how you would even begin to implement that as part of a voice agent. I mean sure you have text input and output channels in some of these things but even then with its own context limitations feels like they could never bee anything then a fancy mouthpiece. But this feels like im possibly looking at this from ignorance. anyways would love to talk on discord with a like minded fella. cheers.

By ilaksh 2026-03-0511:43

For my framework, since I am using it for outgoing calls, what I am thinking maybe is I will add a tool command call_full_duplex(number, persona_name) that will get personaplex warmed up and connected and then pause the streams, then connect the SIP and attach the IO audio streams to the call and return to the agent. Then send the deepgram and personaplex text in as messages during the conversation and tell it to call a hangup() command when personaplex says goodbye or gets off track, otherwise just wait(). It could also use speak() commands to take over with TTS if necessary maybe with a shutup() command first. Need a very fast and smart model for the agent monitoring the call.

By pettyjohn 2026-03-0516:481 reply

what's your use case and what specific LLMs are you using?

I'm using stt > post-trained models > tts for the education tool I'm building, but full STS would be the end-game. e-mail and discord username are in my profile if you want to connect!

By nowittyusername 2026-03-0520:43

sent!

By armcat 2026-03-0519:41

Sure, feel free to reach out, just check my profile!

By scotty79 2026-03-0515:13

I got PersonaPlex to run on my laptop (a beefy one) just by following the step by step instruction on their github repo.

The uncanny thing is that it reacts to speech faster than a person would. It doesn't say useful stuff and there's no clear path to plugging it into smarter models, but it's worth experiencing.

By _magiic_kards 2026-03-0518:28

+1 on this pipeline! You can use a super small model to perform an immediate response and a structured output that pipes into a tool call (which may be a call to a "more intelligent" model) or initiates skill execution. Having this async function with a fast response (TTS) to the user + tool call simultaneously is awesome.

By andreadev 2026-03-0519:331 reply

The framing in this thread is full-duplex vs composable pipeline, but I think the real architecture is both running simultaneously — and this library is already halfway there.

The fact that qwen3-asr-swift bundles ASR, TTS, and PersonaPlex in one Swift package means you already have all the pieces. PersonaPlex handles the "mouth" — low-latency backchanneling, natural turn-taking, filler responses at RTF 0.87. Meanwhile a separate LLM with tool calling operates as the "brain", and when it returns a result you can fall back to the ASR+LLM+TTS path for the factual answer. taf2's fork (running a parallel LLM to infer when to call tools) already demonstrates this pattern. It's basically how humans work — we say "hmm, let me think about that" while our brain is actually retrieving the answer. We don't go silent for 2 seconds.

The hard unsolved part is the orchestration between the two. When does the brain override the mouth? How do you prevent PersonaPlex from confidently answering something the reasoning model hasn't verified? How do you handle the moment a tool result contradicts what the fast model already started saying?

By AlexeyBelov 2026-03-066:362 reply

LLM slop.

By cpill 2026-03-0620:09

Don't be so hard on yourself :P

By andreadev 2026-03-068:081 reply

Which part specifically?

By AlexeyBelov 2026-03-075:011 reply

The part where it's in all your comments.

By andreadev 2026-03-075:31

You are wrong but I am not going to keep going back and forth.

By robotswantdata 2026-03-0517:03

+ 1 , agree still prefer composable pipeline architecture for voice agents. The flexibility on switching LLM for cost optimisation or quality is great for scaled use cases.

By biomcgary 2026-03-0517:201 reply

Do you know if any of these multi-stage approaches can run on an 8gb M1 Air?

By armcat 2026-03-0519:40

They should! If you take Parakeet (ASR), add Qwen 3.5 0.8B (LLM) and Kokoro 82M (TTS), that's about 1.2G + 1.6G + 164M so ~3.5GB (with overhead) on FP16. If you use INT8 or 4-bit versions then are getting down to 1.5-2GB RAM.

And you can always for example swap out the LLM for GPT-5 or Claude.

By vessenes 2026-03-058:325 reply

This is cool. It makes me want an unsloth quant though! A 7b local model with tool calling would be genuinely useful, although I understand this is not that.

UPDATE: I'd skip this for now - it does not allow any kind of interactive conversation - as I learned after downloading 5G of models - it's a proof of concept that takes a wav file in.

By taf2 2026-03-0512:222 reply

I forked and added tool calling by running another llm in parallel to infer when to call tools it works well for me to toggle lights on and off.

Code updates here https://github.com/taf2/personaplex

By ttul 2026-03-0515:541 reply

Cool approach. So basically the part that needs to be realtime - the voice that speaks back to you - can be a bit dumb so long as the slower-moving genius behind the curtain is making the right things happen.

By taf2 2026-03-0520:56

Yes exactly- one part I did not like is we have to also separately transcribe because it does not also provide what the person said only what the ai said

By cpill 2026-03-0617:39

what do you mean "infer"? how does the LLM get anything it of this as input?

By anluoridge 2026-03-0510:58

It provides a voice assistant demo in /Examples/PersonaPlexDemo, which allows you to try turn-based conversations. Real-time conversion is not implemented tho.

By Lapel2742 2026-03-059:121 reply

> I'd skip this for now - it does not allow any kind of interactive conversation - as I learned after downloading 5G of models - it's a proof of concept that takes a wav file in.

I haven't looked into it that much but to my understanding a) You just need an audio buffer and b) Thye seem to support streaming (or at least it's planed)

> Looking at the library’s trajectory — ASR, streaming TTS, multilingual synthesis, and now speech-to-speech — the clear direction was always streaming voice processing. With this release, PersonaPlex supports it.

By isodev 2026-03-0510:381 reply

> You just need an audio buffer

That alone to do right on macOS using Swift is an exercise in pain that even coding bots aren't able to solve first time right :)

By reactordev 2026-03-0511:472 reply

I beg to differ. My agent just one-shotted a MicrophoneBufferManager in swift when asked.

Complete with AVFoundation and a tap for the audio buffer.

It really is trivial.

By hirvi74 2026-03-0517:441 reply

I've also had great results with using LLMs to pry into Apple's private and undocumented APIs. I've been impressed with the lack of hallucinations for C/C++ and Obj-C functions.

I can attest that the quality in this domain has greatly improved over the years too. I am not always fan of the quality of the Swift code that my LLM produces, but I am impressed that what is often produced works in one shot, as well. The quality also is not that important to me because I can just refactor the logic myself, and often prefer to do it anyway. I cannot hold an LLM to any idiosyncrasies that I do not share with it.

By reactordev 2026-03-0518:58

Exactly. Even if it’s a skeleton, as long as it does “The Thing”, I’m happy. I can always refactor into something useful.

By Anonbrit 2026-03-0513:461 reply

Any chance of pushing it to GitHub? My swift knowledge could be written out on an oversized beer coaster currently, so I'm still collecting useful snippets

By reactordev 2026-03-0517:36

https://gist.github.com/gabereiser/cd8c67262717afd2539dc9c3d...

By Tepix 2026-03-058:451 reply

Bummer. Ideally you'd have a PWA on your phone that creates a WebRTC connection to your PC/Mac running this model. Who wants to vibe code it? With Livekit, you get most of the tricky parts served on a silver platter.

By reactordev 2026-03-0511:49

This is the way. This is something I’m working on but for other applications. WebRTC voice and data over LiveKit or Pion to have conversations.

By scotty79 2026-03-0515:17

This is interactive:

https://github.com/NVIDIA/personaplex

By KaiserPister 2026-03-0513:314 reply

I am strongly put off by the LLM writing in this piece. It makes me question quality of the project before even attempting a download.

Who would put effort into building this only to compose a low effort puff piece?

By neurostimulant 2026-03-0515:513 reply

But isn't it normal for people who work on AI stuff to use LLMs for everything? They are very enthusiastic about AI so naturally they'll use it on everything they can.

By DrammBA 2026-03-0516:42

Sometimes I wish they just posted the prompt, not everything has to go through an LLM blender before posting.

By KaiserPister 2026-03-0519:26

That’s a bit reductive. Some do, others don’t. I do a lot of AI development, and building. But I value the act of writing for clarifying my thoughts. And I value other people’s time when reading my writing.

By moffkalast 2026-03-0519:30

They are the ones who should know best when not to use it.

By ttul 2026-03-0515:512 reply

What gives you the sense that the piece was written by an LLM? I would agree that the diagrams have some of the artifacts common in Nano Banana output, but what tips you off about the text?

By rush340 2026-03-0516:11

Em dashes in every other sentence. I've never seen an actual person do that. The language in general reads exactly it's written by an LLM:

"The blah blah didn't just start as blah. It started as blah..." "First came blah -- blah blah blah" "And now: blah"

It's a distinctly AI writing style. I do wonder if we'll get to a point where people start writing this way just because it's what they're used to reading. Or maybe LLMs will get better at not writing like this before that happens.

By tverbeure 2026-03-0516:43

I'm sick and tired of the "No..., no ..., (just) ..." LLM construction. It's everywhere now, you can't open a social media platform and get bombarded by it. This article is full of it.

I get it, I should focus just on the content and whether or not an LLM was used to write it, but the reaction to it is visceral now.

By 0xbadcafebee 2026-03-0517:261 reply

I wasn't put off by it. I read the article, got all the information I needed, it was interesting and informative. (In fact, I find the human-written ones more often annoying; most people are not good at writing, and are apt to create huge walls of text, whereas the AI is biased towards making the information easy to consume)

By KaiserPister 2026-03-0519:24

I do agree it is one of those “if I had more time, I would write a shorter letter” situations.

But in this case the piece is wordier than a bad human writer would be. If they want to use ai for writing, so be it, but at least include “concisely” in the prompt.

By chromehearts 2026-03-0513:361 reply

I hate those AI generated graphs / charts more than the text

By giancarlostoro 2026-03-0514:20

You on about this article or other articles? I dont mind AI generated images to a degree, charts I might start to worry.