Show HN: LemonSlice – Upgrade your voice agents to real-time video

133•lcolucci•1w ago

Hey HN, we're the co-founders of LemonSlice (try our HN playground here: https://lemonslice.com/hn). We train interactive avatar video models. Our API lets you upload a photo and immediately jump into a FaceTime-style call with that character. Here's a demo: https://www.loom.com/share/941577113141418e80d2834c83a5a0a9

Chatbots are everywhere and voice AI has taken off, but we believe video avatars will be the most common form factor for conversational AI. Most people would rather watch something than read it. The problem is that generating video in real-time is hard, and overcoming the uncanny valley is even harder.

We haven’t broken the uncanny valley yet. Nobody has. But we’re getting close and our photorealistic avatars are currently best-in-class (judge for yourself: https://lemonslice.com/try/taylor). Plus, we're the only avatar model that can do animals and heavily stylized cartoons. Try it: https://lemonslice.com/try/alien. Warning! Talking to this little guy may improve your mood.

Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.

How did we get a video diffusion model to run in real-time? There was no single trick, just a lot of them stacked together. The first big change was making our model causal. Standard video diffusion models are bidirectional (they look at frames both before and after the current one), which means you can't stream.

From there it was about fitting everything on one GPU. We switched from full to sliding window attention, which killed our memory bottleneck. We distilled from 40 denoising steps down to just a few - quality degraded less than we feared, especially after using GAN-based distillation (though tuning that adversarial loss to avoid mode collapse was its own adventure).

And the rest was inference work: modifying RoPE from complex to real (this one was cool!), precision tuning, fusing kernels, a special rolling KV cache, lots of other caching, and more. We kept shaving off milliseconds wherever we could and eventually got to real-time.

We set up a guest playground for HN so you can create and talk to characters without logging in: https://lemonslice.com/hn. For those who want to build with our API (we have a new LiveKit integration that we’re pumped about!), grab a coupon code in the HN playground for your first Pro month free ($100 value). See the docs: https://lemonslice.com/docs. Pricing is usage-based at $0.12-0.20/min for video generation.

Looking forward to your feedback!

EDIT: Tell us what characters you want to see in the comments and we can make them for you to talk to (e.g. Max Headroom)

*We did a Show HN last year for our V1 model: https://news.ycombinator.com/item?id=43785044. It was technically impressive but so bad compared to what we have today.

Comments

zvonimirs•1w ago

We're launching a new AI assistant and I wanted to make it alive so I started to play around with LemonSlice and I loved it!! I wanted to make our assistant be like a coworker that can give it an ability to create Loom style videos. Here's what I created - https://drive.google.com/file/d/1nIpEvNkuXA0jeZVjHC8OjuJlT-3...

Anyway, big thumbs up for the LemonSlice team, I'm excited to see it progress. I can definitely see products start coming alive with tools like this.

sid-the-kid•1w ago

Very cool! Thanks for sharing. I love your use-case of turning an AI coding agent into more of an AI employee. Will be interesting to see if users can connect better with the product this way.

bzmrgonz•1w ago

How did your token spend add up? I'm hesitant with evil customers racking up ai charges just to shit and giggles. Even competitors might sponsor some runaway charges.

sid-the-kid•1w ago

hey HN! one of the founders here. as of today, we are seeing informational avatars + roleplaying for training as the most common use cases. The roleplaying use-case was surprising to us. Think a nurse training to triage with AI patients. Or, SDRs practicing lead qualification with different kinds of clients.

buddycorp•1w ago

I'm curious if I can plug in my own OpenAI realtime voice agents into this.

lcolucci•1w ago

Good question! Yes and to do this you'd want to use our "Self-Managed Pipeline": https://lemonslice.com/docs/self-managed/overview. You can combine any TTS, LLM and STT combination with LemonSlice as the avatar layer.

jfaat•1w ago

I'm using an openAI realtime voice with livekit, and they said they have a livekit integration so it would probably be doable that way. I haven't used video in livekit though and I don't know how the plugins are setup for it

lcolucci•1w ago

Yes this is exactly right. Using the LiveKit integration you can add LemonSlice as an avatar layer on top of any voice provider

tmshapland•1w ago

Here's the link to the LiveKit LemonSlice plugin. It's very easy to get started. https://docs.livekit.io/agents/models/avatar/plugins/lemonsl...

sid-the-kid•1w ago

Good question. When using the API, you can bring any voice agent (or LLM). Our API takes in what the agent will say, and then streams back the video of the agent saying it.

For the fully hosted version, we are currently partnered with ElevenLabs.

ed_mercer•1w ago

This looks super awesome!

sid-the-kid•1w ago

thank you! it's by far the thing I have worked on that I am most proud of.

dreamdeadline•1w ago

Cool! Do you plan to expose controls over the avatar’s movement, facial expressions, or emotional reactions so users can fine-tune interactions?

lcolucci•1w ago

Yes we do! Within the web app, there's a "action text prompt" section that allows you to control the overall actions of the character (e.g. "a fox talking with lots of arm motions"). We'll soon expose this in the API so you can control the characters movements dynamically (e.g. "now wave your hand")

sid-the-kid•1w ago

Our text control is good, especially for emotions. For example, you can add the text prompt: "a person talking. they are angry", and agent will have an angry expression.

You can also control background motions (like ocean waves, or a waterfall or car driving).

We are actively training a model that has better text control over hand motions.

marieschneegans•1w ago

This is next-level!

lcolucci•1w ago

Thanks so much! We're super proud of it

bennyp101•1w ago

Heads up, your privacy policy[0] does not work in dark mode - I was going to comment saying it made no sense, then I highlighted the page and more text appeared :)

[0] https://lemonslice.com/privacy

sid-the-kid•1w ago

Good catch! Working on a fix now.

sid-the-kid•1w ago

Fix deployed! This is why it's good to launch on hacker news. thanks for the tip.

bennyp101•1w ago

Nice one - thanks :)

benswerd•1w ago

The last year vs this year is crazy

sid-the-kid•1w ago

thanks! it just barley worked last year, but not much else. this year it's actually good. we got lucky: it's both new tech and turned out to be good quality.

lcolucci•1w ago

Agreed. We were so excited about the results last year and they are SO BAD now by comparison. Hopefully we'll say the same thing again in the couple months

bluedel•1w ago

Hopefully not. I'm impressed with the engineering, it is a technological achievement, but my only hope right now is that this tech plateaus pretty much immediately. I can't think of a single use-case that wouldn't be at best value-neutral, and at worst extremely harmful to the people interacting with it.

r0fl•1w ago

Wow this is the most impressive thing I’ve seen on hacker news in years!!!!!

Take my money!!!!!!

lcolucci•1w ago

Wow thank you so much :) We're so proud of it!!

skandan•1w ago

Wow this team is non-stop!!! Wild that this small crew is dropping hit after hit. Is there an open polymarket on who acquires them?

lcolucci•1w ago

haha thank you so much! The team is incredible - small but mighty

r0fl•1w ago

Where’s the hn playground to grab a free month?

I have so many websites that would do well with this!

lcolucci•1w ago

https://lemonslice.com/hn - There's a button for "Get 1st month free" in the Developer Quickstart

dang•1w ago

(We've replaced the link to their homepage (https://lemonslice.com/) with the HN playground at the start of the text above)

lcolucci•1w ago

Thanks Dan! The HN playground let's anyone try out for free without login

r0fl•1w ago

Pricing is confusing

Video Agents Unlimited agents Up to 3 concurrent calls Creative Studio 1min long videos Up to 3 concurrent generations

Does that mean I can have a total of 1 minute of video calls? Or video calls can only be 1 minute long? Or does it mean I can have unlimited calls, 3 calls at a time all month long?

Can I have different avatars or only the same avatar x 3?

Can I record the avatar and make videos and post on social media?

lcolucci•1w ago

Sorry about the confusion. Video Agents and Creative Studio are two entirely different products. Video Agents = interactive video. Creative Studio = make a video and download it. If you're interested in real-time video calls, then Video Agents is the only pricing and feature set you should look at.

thedangler•1w ago

What happens if I want to make the video on the fly and save that to reuse it when the same question or topic comes up. No need to render a video. Just play the existing one.

andrew-w•1w ago

This isn't natively supported -- we are continuously streaming frames throughout the conversation session that are generated in real-time. If you were building your own conversational AI pipeline (e.g. using our LiveKit integration), I suppose it would be possible to route things like this with your own logic. But it would probably include jump cuts and not look as good.

r0fl•1w ago

Wow I can’t get enough of this site! This is literally all I’ve been playing with for like half an hour. Even moved a meeting!

My mind is blown! It feels like the first time I used my microphone to chat with ai

sid-the-kid•1w ago

glad we found somebody who likes it as much as us! BTW, biggest thing we are working to improve is speed of the response. I think we can make that much faster.

lcolucci•1w ago

This comment made my day! So happy you're liking it

koakuma-chan•1w ago

> You're probably thinking, how is this useful

I was thinking why the quality is so poor.

sid-the-kid•1w ago

curious what avatar you think is poor quality? Or, what you think is poor quality. i want to know :)

koakuma-chan•1w ago

Low res and low fps. Not sure if lipsync is poor, or if low fps makes it look poor. Voice sounds low quality, as if recorded on a bad mic, and doesn't feel like it matches the avatar.

sid-the-kid•1w ago

thanks for the feedback. that's helpful. Ya, some avatars have worse lip synch than others. It depends a little on how zoomed in you are.

I am double checking now to make 100% sure we return the original audio (and not the encoded/decoded audio).

We are working on high-res.

koakuma-chan•1w ago

Good luck.

wumms•1w ago

You could add a Max Headroom to the hn link. You might reach real time by interspersing freeze frames, duplicates, or static.

sid-the-kid•1w ago

1) yes on Max Headroom. we are on it. 2) it already is real time...?

wumms•1w ago

Whoops! Mistook the "You're about to speak with an AI."-progress bar for processing delay.

sid-the-kid•1w ago

Makes sense. The init should be about 10s. But, after that, it should be real time. TBH, this is probably a common confusion. So thanks for calling it out.

lcolucci•1w ago

I wonder if we should make the UI a more common interface (e.g. "the call is ringing") to avoid this confusion?

It's a normal mp4 video that's looping initially (the "welcome message") and then as soon as you send the bot a message, we connect you to a GPU and the call becomes interactive. Connecting to the GPU takes about 10s.

sid-the-kid•1w ago

And, just like that, Max Headroom is back: https://lemonslice.com/try/agent_ccb102bdfc1fcb30

sbarre•1w ago

That.. is not Max Headroom.

andrew-w•1w ago

I wonder how it would come across with the right voice. We're focused on building out the video layer tech, but at the end of the day, the voice is also pretty important for a positive experience.

lcolucci•1w ago

Can you help us make him? What's the right voice? https://lemonslice.com/hn

sbarre•1w ago

https://www.youtube.com/watch?v=cYdpOjletnc

shj2105•1w ago

Not working on mobile iOS

lcolucci•1w ago

what's not working for you?

convivialdingo•1w ago

That's super impressive! Definitely one of the best quality conversational agents I've tried syncing A/V and response times.

The text processing is running Qwen / Alibaba?

sid-the-kid•1w ago

Thank you! Yes, right now we are using Qwen for the LLM. They also released a super fast TTS model that we have not tried yet, which is supposed to be very fast.

lcolucci•1w ago

Qwen is the default but you can pick any LLM in the web app (though not the HN playground)

pickleballcourt•1w ago

One thing I've learnt from movie production is actually what separates professional from amateur quality is in the audio itself. Have you thought about implementing personaplex from NVDIA or other voice models that can both talk and listen at the same time?

Currently the conversation still feels too STT-LLM-TTS that I think a lot of the voice agents suffer from (Seems like only Sesame and NVDIA so far have nailed the natural conversation flow). Still, crazy good work train your own diffusion models, I remember taking a look at the latest literature on diffusion and was mind blown by the advances in last years or so since u-net architecture days.

EDIT: I see that the primary focus is on video generation not audio.

lcolucci•1w ago

This is a good point on audio. Our main priority so far has been reducing latency. In service of that, we were deep in the process of integrating Hume's two-way S2S voice model instead of ElevenLabs. But then we realized that ElevenLabs had made their STT-LLM-TTS pipeline way faster in the past month and left it at that. See our measurements here (they're super interesting): https://docs.google.com/presentation/d/18kq2JKAsSahJ6yn5IJ9g...

But, to your point, there are many benefits of two-way S2S voice beyond just speed.

Using our LiveKit integration you can use LemonSlice with any voice provider you like. The current S2S providers LiveKit offers include OpenAI, Gemini, and Grok and I'm sure they'll add Personaplex soon.

pickleballcourt•1w ago

Thanks for sharing! Makes sense to go with latency first.

echelon•1w ago

I'm a filmmaker. While what OP said is 100% true, your instincts are right.

Not only is perfect is the enemy of good enough, you're only looking for PMF signal at this point. If you chase quality right now, you'll miss validation and growth.

The early "Will Smith eating spaghetti" companies didn't need perfect visuals. They needed excited early adopter customers. Now look where we're at.

In the fullness of time, all of these are just engineering problems and they'll all be sorted out. Focus on your customer.

korneelf1•1w ago

Wow this is really cool, haven't seen real-time video generation that is this impressive yet!

lcolucci•1w ago

Thank you so much! It's been a lot of fun to build

ProjectBarks•1w ago

Removing - Realized I made a mistake

dang•1w ago

I don't see any evidence that r0fl's comments are astroturfing. Sometimes people are just enthusiastic.

I appreciate your concern for the quality of the site - that fact that the community here cares so much about protecting it is the main reason why it continues to survive. Still, it's against HN's rules to post like you did here. Could you please review https://news.ycombinator.com/newsguidelines.html? Note this part:

"Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data."

sid-the-kid•1w ago

it's a fair concern. but, we don't know r0fl. and we are not astroturfing.

even I am surprised with how many opnely positive comments we are getting. it's not been our experience in the past.

jedwhite•1w ago

That's an interesting insight about "stacking tricks" together. I'm curious where you found that approach hit limits. And what gives you an advantage if anything against others copying it. Getting real-time streaming with a 20B parameter diffusion model and 20fps on a single GPU seems objectively impressive. It's hard to resist just saying "wow" looking at the demo, but I know that's not helpful here. It is clearly a substantial technical achievement and I'm sure lots of other folks here would be interested in the limits with the approach and how generalizable it is.

sid-the-kid•1w ago

Good question! Software gets democratized so fast that I am sure others will implement similar approaches soon. And, to be clear, some of our "speed upgrades" are pieced together from recent DiT papers. I do think getting everything running on a single GPU at this resolution and speed is totally new (as far as i have seen).

I think people will just copy it, and we just need to continue moving as fast as we can. I do think that a bit of a revolution is happening right now in real-time video diffusion models. There are so many great papers being published in that area in the last 6 months. My guess is that many DiT models will be real time within 1 year.

sid-the-kid•1w ago

One thing that is interesting: LLMs pipelines have been highly optimize for speed (since speed is directly related to cost for companies). That is just not true for real-time DiTs. So, there is still lots of low hanging fruit for how we (and others) can make things faster and better.

storystarling•1w ago

Curious about the memory bandwidth constraints here. 20B parameters at 20fps seems like it would saturate the bandwidth of a single GPU unless you are running int4. I assume this requires an H100?

andrew-w•1w ago

Yep, the model is running on Hopper architecture. Anything less was not sufficient in our experiments.

jedwhite•1w ago

> I do think getting everything running on a single GPU at this resolution and speed is totally new

Thanks, it seemed to be the case that this was really something new, but HN tends to be circumspect so wanted to check. It's an interesting space and I try to stay current but everything is moving so fast. But I was pretty sure I hadn't seen anyone do that. Its a huge achievement to do it first and make it work for real like this! So well done!

peddling-brink•1w ago

I got really excited when I saw that you were releasing your model.

> Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.

But after digging around for a while, searching for a huggingface link, I’m guessing this was just a unfortunate turn of phrase, and you are not in fact, releasing an open weights model that people can run themselves?

Oh well, this looks very cool regardless and congratulations on the release.

sid-the-kid•1w ago

Thank you! We are considering to release an open-source version of the model. Somebody will do it soon. Might as well be us. We are mostly concerned with the additional overhead of releasing and then supporting it. So, TBD.

js4ever•1w ago

Overhead? None Your real concern is: will potential customers run the model by themselves and skipping us?

Answer is no because you will eventually release a subpar model not your sorta model.

Also people don't have infrastructure to run this at scale (100-500 concurrent users) at best they can run it for 1-2 concurrent users.

This could be a good way for peoples to test it then use your infra.

Ah but you do have an online demo, so you might think this is enough, WRONG.

andrew-w•1w ago

Thanks! And sorry! I can see how our wording there could be misconstrued. With a real-time model, the streaming infrastructure matters almost as much as the weights themselves. It will be interesting to see how easily they can be commoditized in the future.

swyx•1w ago

this is like Tavus but it doesnt suck. congrats!

lcolucci•1w ago

Thank you! And the cool thing is it's actually a full video world model. We'll expose more of those capabilities soon

FatalLogic•1w ago

Your demo video defaults to play at 1.5x speed

You probably didn't intend to do that

lcolucci•1w ago

whoops I actually did set that on purpose. I guess I like watching things sped up and assumed others did too :) But you can change it.

ripped_britches•1w ago

Very freaking impressive!

lcolucci•1w ago

Thank you so much!

snowmaker•1w ago

I made a golden retriever you can talk to using Lemon Slice: https://lemonslice.com/hn/agent_5af522f5042ff0a8

Having a real-time video conversation with an AI is a trippy feeling. Talk about a "feel the AGI moment", it really does feel like the computer has come alive.

knowitnone3•1w ago

great. you'll never have to talk to another human being ever again

lcolucci•1w ago

So cool! I love how he sometimes looks down and over his glasses

givinguflac•1w ago

While the tech is impressive, from an end user interacting with this perspective, I want nothing to do with it, and I can’t support it. Neat as a one off but destructive imho.

It’s bad enough some companies are doing AI-only interviews. I could see this used to train employees, interview people, replace people at call centers… it’s the next step towards an absolute nightmare. Automated phone trees are bad enough.

There will likely be little human interaction in those and many other situations, and hallucinations will definitely disqualify some people from some jobs.

I’m not anti AI, I’m anti destructive innovation in AI leading to personal health and societal issues, just like modern social media has. I’m not saying this tool is that, I’m saying it’s a foundation for that.

People can choose to not work on things that lead to eventual negative outcomes, and that’s a personal choice for everyone. Of course hindsight is 20/20 but some things can certainly be foreseen.

Apologies for the seemingly negative rant, but this positivity echo chamber in this thread is crazy and I wanted to provide an alternative feedback view.

bbor•1w ago

Don't be naive -- if I don't make the Torment Nexus, someone else will ;)

canada_dry•1w ago

> AI-only interviews

Lord. I can see this quickly extending even further into HR e.g. performance reviews: employee must 'speak' to an HR avatar about their performance in the last quarter. the AI will then summarize the discussion for the manager and give them coaching tips.

It sounds valuable and efficient but the slippery slope is all but certain.

jonsoft•1w ago

I asked the Spanish tutor if he/it was familiar with the terms seseo[0] and ceceo[1] and he said it wasn't, which surprised me. Ideally it would be possible to choose which Spanish dialect to practise as mainland Spain pronunciation is very different to Latin America. In general it didn't convince me it was really hearing how I was pronouncing words, an important part of learning a language. I would say the tutor is useful for intermediate and advanced speakers but not beginners due to this and the speed at which he speaks.

At one point subtitles written in pseudo Chinese characters were shown; I can send a screenshot if this is useful.

The latency was slightly distracting, and as others have commented the NVIDIA Personaplex demos [2] are very impressive in this regard.

In general, a very positive experience, thank you.

[0] https://en.wikipedia.org/wiki/Phonological_history_of_Spanis... [1] https://en.wikipedia.org/wiki/Phonological_history_of_Spanis... [2] https://research.nvidia.com/labs/adlr/personaplex/

andrew-w•1w ago

Thanks for the feedback. The current avatars use a STT-LLM-TTS pipeline (rather than true speech-to-speech), which limits nuanced understanding of pronunciations. Speech-to-speech models should solve this problem. (The ones we've tried so far have counterintuitively not been fast enough.)

sid-the-kid•1w ago

ooof. You saw the Chinese text. Yup, that's super annoying. We are trying to squash that hallucination.

Thanks for the feedback! That's helpful!

Terretta•1w ago

the chinese text happened last night in your main chat agent widget, the cartoon woman professing to be in a town in brazil with a lemon tree on her cupboard. she claimed it was a test of subtitling then admitted it wasn't.

btw, she gives helpful instructions like "/imagine" whatever but the instructions only seem to work about 50% of the time. meaning, try the same command or variants a few times, and it works about half of them. she never did shift out of aussie accent though.

she came up with a remarkably fanciful explanation why as a brazilian she sounded aussie and why imagining native accent like she said would work didn't...

i was shocked when /imagine face left turn to the side did actually work, the agent was in side profile and precisely as natural as the original front facing avatar

all in all, by far the best agent experience i've played with!

andrew-w•1w ago

So glad you enjoyed it! We've been able to significantly reduce those text hallucinations with a few tricks, but it seems they haven't been fully squashed. The /imagine command only works with the image at the moment, but we'll think about ways to tie that into the personality and voice. Thanks for the feedback!

wahnfrieden•1w ago

Please add InWorld TTS integration

lcolucci•1w ago

That's a good one. I would suggest asking them to integrate with LiveKit. Then it'll be really easy to combine InWorld and LemonSlice.

davidz•1w ago

we gotchu: https://docs.livekit.io/agents/models/tts/inference/inworld/

sid-the-kid•1w ago

never head of InWorld. Pretty impressive.

dang•1w ago

https://lemonslice.com/hn/agent_4d10f62632fd841b

(Update of https://news.ycombinator.com/item?id=43785494)

lcolucci•1w ago

The curve is accelerating!

pbhjpbhj•1w ago

Sounds like an innovative approach, any IP protection on your tech?

Have your early versions made any sort of profit?

Absolutely amazing stuff to me. A teenager I very briefly showed it to was nonplussed - 'it's a talking head, isn't that really easy to do' ...

andrew-w•1w ago

Haha, I kind of get that reaction. Convincing the world "this was hard to do" is generally not easy. Re: user uploads, we're operating in good faith at the moment (no built-in IP moderation). This hasn't been an issue so far. Current pricing reflects our operating costs. Each end-user gets a dedicated GPU for the duration of a call, which is expensive. Advancements on the model-side should eventually allow us to parallelize this.

anigbrowl•1w ago

Absolutely Do Not Want.

EDIT: Tell us what characters you want to see in the comments and we can make them for you to talk to (e.g. Max Headroom)

Sure, that kind of thing is great fun. But photorealistic avatars are gonna be abused to hell and back and everyone knows it. I would rather talk to a robot that looks like a robot, ie C-3PO. I would even chat with scary skeleton terminator. I do not want to talk with convincingly-human-appearing terminator. Constantly checking whether any given human appearing on a screen is real or not is a huge energy drain on my primate brain. I already find it tedious with textual data, doing it on realtime video imagery consumers considerably more energy.

Very impressive tech, well done on your engineering achievement and all, but this is a Bad Thing.

echelon•1w ago

The dichotomy of AI haters and AI dreamers is wild.

OP, I think this is the coolest thing ever. Keep going.

Naysayers have some points, but nearly every major disruptive technology has had downsides that have been abused. (Cars can be used for armed robbery. Steak knives can be used to murder people. Computers can be used for hacking.)

The upsides of tech typically far outweigh the downsides. If a tech is all downsides, then the government just bans it. If computers were so bad, only government labs and facilities would have them.

I get the value in calling out potential dangers, but if we do this we'll wind up with the 70 years where we didn't build nuclear reactors because we were too afraid. As it turns out, the dangers are actually negligible. We spent too much time imagining what would go wrong, and the world is now worse for it.

The benefits of this are far more immense.

While the world needs people who look at the bad in things, we need far more people who dream of the good. Listen to the critiques, allow it to aid in your safety measures, but don't listen to anyone who says the tech is 100% bad and should be stopped. That's anti nuclear rhetoric, and it's just not true.

Keep going!

lcolucci•1w ago

Well put - and thanks, we'll keep building. Still chasing this level of magic: https://youtu.be/gL5PgvFvi8A?si=I__VSDqkXBdBTVvB&t=173 Not to mention language tutors, training experiences, and more.

anigbrowl•1w ago

I am not an AI hater, I use it every day. I made specific criticisms of why I think photorealistic realtime AI avatars are a problem; you've posted truisms. Please tell me what benefits you expect to reap from this.

zestyping•1w ago

The primary purpose of generating real-time video of realistic-looking talking people is deception. The explicit goal is to make people believe that they're talking to a real person when they aren't.

It's on you to identify the "immense" benefits that outweigh that explicit goal. What are they?

nashashmi•1w ago

> this is a Bad Thing.

"Your hackers were so preoccupied with whether or not they could, they didn't stop to think if they should."

mdrzn•1w ago

"we're releasing our new model" is it downloadable and runnable in local? Could I create a "vTuber" persona with this model?

andrew-w•1w ago

We have not released the weights, but it is fully available to use in your websites or applications. I can see how our wording there could be misconstrued -- sorry about that. You can absolutely create a vTuber persona. The link in the post is still live if you want to create one (as simple as uploading an image, selecting a voice, and defining the personality). We even have a prebuilt UI you can embed in a website, just like a youtube video.

bn-l•1w ago

I wish I could invest in this company. Really. This is the most exciting revenue opportunity I’ve seen during this recent AI hype cycle.

lcolucci•1w ago

That's super nice of you to say. Thank you!

armcat•1w ago

This is so awesome, well done LemonSlice team! Super interesting on the ASR->LLM->TTS pipeline, and I agree, you can make it super fast (I did something myself as a 2-hour hobby project: https://github.com/acatovic/ova). I've been following full-duplex models as well and so far couldn't get even PersonaPlex to run properly (without choppiness/latency), but have you peeps tried Sesame, e.g. https://app.sesame.com/?

I played around with your avatars and one thing that it lacks is that it's "not patient", it's rushing the user, so maybe something to try and finetune there? Great work overall!

andrew-w•1w ago

Thank you! Impressive demo with OVA. Still feels very snappy, even fully local. It will be interesting to see how video plays out in that regard. I think we're still at least a year away from the models being good enough and small enough that they can run on consumer hardware. We compared 6 of the major voice providers on TTFB, but didn't try Sesame -- we'll need to give that one a look. https://docs.google.com/presentation/d/18kq2JKAsSahJ6yn5IJ9g...

lcolucci•1w ago

This is good feedback thanks! The "not patient" feeling probably comes from our VAD being set to "eager mode" so that the latency is better. VAD (i.e. deciding when the human has actually stopped talking) is a tough problem in all of voice AI. It basically adds latency to whatever your pipeline's base latency is. Speech2Speech models are better at this.

leetrout•1w ago

Quick feedback if you're still monitoring the thread:

I did /imagine cheeseburger and /imagine a fire extinguisher and both were correctly generated but the agent has no context. when I ask what they are holding in both cases they ramble about not holding anything and referencing lemons and lemon trees.

I expected it to retain the context as the chat continues. If I ask it what it imagined it just tells me I can use /imagine.

andrew-w•1w ago

Not something we had thought to do tbh, but would definitely enhance the experience. And, should be reasonable to do. Thanks!

jamesdelaneyie•1w ago

I didn't know /imagine could be followed by a prompt, but similarly I asked the avatar about it's appearance and stated it had none. Should probably give it the context of what it's appearance is like, same thing happened for questions like where are you? What are you holding? Who's that behind you? etc etc

lcolucci•1w ago

This is so obvious now that you say it (* facepalm *). We definitely need to give the LLM context on the appearance (both from the initial image as well as any /imagine updates during the call). Thanks for pointing it out!

lcolucci•1w ago

Good idea. We need to do that. I'm also excited to push the /imagine stuff further and have B-roll interspersed with the talking (like a documentary) or even follow the character around as they move (like a video game)

Escapado•1w ago

This was interesting. Had a 5 minute chat with the outsider from the dishonored series. Just a one sentence prompt and its phrasing was at least 60% there, but less cold and nicer in a sense than the video game counterpart. Still an interesting experiment. But I also know that maybe 12-24 months down the line, once this is available in real time on device there will be an ungodly amount of smut coming from this.

dsrtslnd23•1w ago

where can I find the 20B model? it sounded like it would be open - but I am not sure with the phrasing...

andrew-w•1w ago

We have not released the weights, but it is fully available to use in your websites or applications. I can see how our wording there could be misconstrued -- sorry about that.

slake•1w ago

That's amazing. Feels like a major step ahead. No lag, very snappy. Outstanding work.

Feels like those sci-fi shows where you can talk to Hari Seldon even though he lived like a 100 years ago.

My prediction, this will become really, really big.

zestyping•1w ago

When you generate real-time video of realistic-looking talking characters, the definition of success is fooling people into believing they are talking to a real person when they aren't.

If you pursue this, your explicit goal is deception, and it's a massively harmful kind of deception. I don't see how you can claim to be operating ethically here if that's your goal.

lcolucci•1w ago

Do you think the same about text that is indistinguishable from human-written text (LLM chatbots)? Or voice that is indistinguishable from a human talking?

Illegal things, like fraud and impersonation, are illegal. There's a difference between the tool and the actions people do with the tool.

There are tons of useful applications of interactive avatars - from corporate training to kids education to language learning and more. Plus, why would you want to stop this little guy from existing in the world? :) https://lemonslice.com/try/alien

beast200•1w ago

That's really impressive!

lcolucci•1w ago

Thank you!

Obertr•5d ago

Have just tried it. Impressive. We are definitely moving into this space

Questions what are the main differences between you and anam.ai ? They also do real time lip sync plus looks like they are cheaper. Do you optimise for price or quality? And do you focus on lipsync or full movement etc?

I Write Games in C (yes, C)

We Mourn Our Craft

SectorC: A C Compiler in 512 bytes

Hoot: Scheme on WebAssembly

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The AI boom is causing shortages everywhere else

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Coding agents have replaced every framework I used

France's homegrown open source online office suite

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

History and Timeline of the Proco Rat Pedal (2021)

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

I Write Games in C (yes, C)

We Mourn Our Craft

SectorC: A C Compiler in 512 bytes

Hoot: Scheme on WebAssembly

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The AI boom is causing shortages everywhere else

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Coding agents have replaced every framework I used

France's homegrown open source online office suite

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

History and Timeline of the Proco Rat Pedal (2021)

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Hackers (1995) Animated Experience

Making geo joins faster with H3 indexes

Sheldon Brown's Bicycle Technical Info

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: LemonSlice – Upgrade your voice agents to real-time video

Comments