Hacker News

Show HN: LemonSlice – Upgrade your voice agents to real-time video

2026-01-2717:55133133

Hey HN, we're the co-founders of LemonSlice (try our HN playground here: https://lemonslice.com/hn). We train interactive avatar video models. Our API lets you upload a photo and immediately jump into a FaceTime-style call wi...

Show article

Chatbots are everywhere and voice AI has taken off, but we believe video avatars will be the most common form factor for conversational AI. Most people would rather watch something than read it. The problem is that generating video in real-time is hard, and overcoming the uncanny valley is even harder.

We haven’t broken the uncanny valley yet. Nobody has. But we’re getting close and our photorealistic avatars are currently best-in-class (judge for yourself: https://lemonslice.com/try/taylor). Plus, we're the only avatar model that can do animals and heavily stylized cartoons. Try it: https://lemonslice.com/try/alien. Warning! Talking to this little guy may improve your mood.

Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.

How did we get a video diffusion model to run in real-time? There was no single trick, just a lot of them stacked together. The first big change was making our model causal. Standard video diffusion models are bidirectional (they look at frames both before and after the current one), which means you can't stream.

From there it was about fitting everything on one GPU. We switched from full to sliding window attention, which killed our memory bottleneck. We distilled from 40 denoising steps down to just a few - quality degraded less than we feared, especially after using GAN-based distillation (though tuning that adversarial loss to avoid mode collapse was its own adventure).

And the rest was inference work: modifying RoPE from complex to real (this one was cool!), precision tuning, fusing kernels, a special rolling KV cache, lots of other caching, and more. We kept shaving off milliseconds wherever we could and eventually got to real-time.

We set up a guest playground for HN so you can create and talk to characters without logging in: https://lemonslice.com/hn. For those who want to build with our API (we have a new LiveKit integration that we’re pumped about!), grab a coupon code in the HN playground for your first Pro month free ($100 value). See the docs: https://lemonslice.com/docs. Pricing is usage-based at $0.12-0.20/min for video generation.

Looking forward to your feedback!

EDIT: Tell us what characters you want to see in the comments and we can make them for you to talk to (e.g. Max Headroom)

*We did a Show HN last year for our V1 model: https://news.ycombinator.com/item?id=43785044. It was technically impressive but so bad compared to what we have today.

lcolucci

Karma: 273

@Hacker__News
@hacker._news

Comments

By anigbrowl 2026-01-289:352 reply

Absolutely Do Not Want.

EDIT: Tell us what characters you want to see in the comments and we can make them for you to talk to (e.g. Max Headroom)

Sure, that kind of thing is great fun. But photorealistic avatars are gonna be abused to hell and back and everyone knows it. I would rather talk to a robot that looks like a robot, ie C-3PO. I would even chat with scary skeleton terminator. I do not want to talk with convincingly-human-appearing terminator. Constantly checking whether any given human appearing on a screen is real or not is a huge energy drain on my primate brain. I already find it tedious with textual data, doing it on realtime video imagery consumers considerably more energy.

Very impressive tech, well done on your engineering achievement and all, but this is a Bad Thing.

By echelon 2026-01-2814:233 reply

The dichotomy of AI haters and AI dreamers is wild.

OP, I think this is the coolest thing ever. Keep going.

Naysayers have some points, but nearly every major disruptive technology has had downsides that have been abused. (Cars can be used for armed robbery. Steak knives can be used to murder people. Computers can be used for hacking.)

The upsides of tech typically far outweigh the downsides. If a tech is all downsides, then the government just bans it. If computers were so bad, only government labs and facilities would have them.

I get the value in calling out potential dangers, but if we do this we'll wind up with the 70 years where we didn't build nuclear reactors because we were too afraid. As it turns out, the dangers are actually negligible. We spent too much time imagining what would go wrong, and the world is now worse for it.

The benefits of this are far more immense.

While the world needs people who look at the bad in things, we need far more people who dream of the good. Listen to the critiques, allow it to aid in your safety measures, but don't listen to anyone who says the tech is 100% bad and should be stopped. That's anti nuclear rhetoric, and it's just not true.

Keep going!

By zestyping 2026-01-2916:091 reply

The primary purpose of generating real-time video of realistic-looking talking people is deception. The explicit goal is to make people believe that they're talking to a real person when they aren't.

It's on you to identify the "immense" benefits that outweigh that explicit goal. What are they?

By lcolucci 2026-02-095:38

I don't think that's the primary purpose of realistic interactive avatars, any more than deception is the purpose of CGI. Deception requires intent to mislead — if users know they're talking to an avatar, it's not deception no matter how realistic. Just as moviegoers aren't "deceived" by CGI. It's an experience they opt into.

As for benefits: language learning with avatars, scalable corporate training, accessible education for kids, personalized coaching, and certainly entertainment, which has real value too.

By lcolucci 2026-01-2817:49

Well put - and thanks, we'll keep building. Still chasing this level of magic: https://youtu.be/gL5PgvFvi8A?si=I__VSDqkXBdBTVvB&t=173 Not to mention language tutors, training experiences, and more.

By anigbrowl 2026-01-2822:15

I am not an AI hater, I use it every day. I made specific criticisms of why I think photorealistic realtime AI avatars are a problem; you've posted truisms. Please tell me what benefits you expect to reap from this.

By nashashmi 2026-01-2820:56

> this is a Bad Thing.

"Your hackers were so preoccupied with whether or not they could, they didn't stop to think if they should."

By armcat 2026-01-2811:512 reply

This is so awesome, well done LemonSlice team! Super interesting on the ASR->LLM->TTS pipeline, and I agree, you can make it super fast (I did something myself as a 2-hour hobby project: https://github.com/acatovic/ova). I've been following full-duplex models as well and so far couldn't get even PersonaPlex to run properly (without choppiness/latency), but have you peeps tried Sesame, e.g. https://app.sesame.com/?

I played around with your avatars and one thing that it lacks is that it's "not patient", it's rushing the user, so maybe something to try and finetune there? Great work overall!

By lcolucci 2026-01-2817:54

This is good feedback thanks! The "not patient" feeling probably comes from our VAD being set to "eager mode" so that the latency is better. VAD (i.e. deciding when the human has actually stopped talking) is a tough problem in all of voice AI. It basically adds latency to whatever your pipeline's base latency is. Speech2Speech models are better at this.

By andrew-w 2026-01-2815:12

Thank you! Impressive demo with OVA. Still feels very snappy, even fully local. It will be interesting to see how video plays out in that regard. I think we're still at least a year away from the models being good enough and small enough that they can run on consumer hardware. We compared 6 of the major voice providers on TTFB, but didn't try Sesame -- we'll need to give that one a look. https://docs.google.com/presentation/d/18kq2JKAsSahJ6yn5IJ9g...

By pickleballcourt 2026-01-2720:451 reply

One thing I've learnt from movie production is actually what separates professional from amateur quality is in the audio itself. Have you thought about implementing personaplex from NVDIA or other voice models that can both talk and listen at the same time?

Currently the conversation still feels too STT-LLM-TTS that I think a lot of the voice agents suffer from (Seems like only Sesame and NVDIA so far have nailed the natural conversation flow). Still, crazy good work train your own diffusion models, I remember taking a look at the latest literature on diffusion and was mind blown by the advances in last years or so since u-net architecture days.

EDIT: I see that the primary focus is on video generation not audio.

By lcolucci 2026-01-2721:082 reply

This is a good point on audio. Our main priority so far has been reducing latency. In service of that, we were deep in the process of integrating Hume's two-way S2S voice model instead of ElevenLabs. But then we realized that ElevenLabs had made their STT-LLM-TTS pipeline way faster in the past month and left it at that. See our measurements here (they're super interesting): https://docs.google.com/presentation/d/18kq2JKAsSahJ6yn5IJ9g...

But, to your point, there are many benefits of two-way S2S voice beyond just speed.

Using our LiveKit integration you can use LemonSlice with any voice provider you like. The current S2S providers LiveKit offers include OpenAI, Gemini, and Grok and I'm sure they'll add Personaplex soon.

By echelon 2026-01-2814:32

I'm a filmmaker. While what OP said is 100% true, your instincts are right.

Not only is perfect is the enemy of good enough, you're only looking for PMF signal at this point. If you chase quality right now, you'll miss validation and growth.

The early "Will Smith eating spaghetti" companies didn't need perfect visuals. They needed excited early adopter customers. Now look where we're at.

In the fullness of time, all of these are just engineering problems and they'll all be sorted out. Focus on your customer.

By pickleballcourt 2026-01-2721:53

Thanks for sharing! Makes sense to go with latency first.