Launch HN: Leaping (YC W25) – Self-Improving Voice AI

49•akyshnik•8h ago

Hey HN, I'm Arkadiy from Leaping AI (https://leapingai.com). Leaping lets you build voice AI agents in a multi-stage, graph-like format that makes testing and improvement much easier. By evaluating each stage of a call, we can trace errors and regressions to a particular stage. Then we autonomously vary the prompt for that stage and A/B test it, allowing agents to self-improve over time.

You can talk to one of our bots directly at https://leapingai.com, and there’s a demo video at https://www.youtube.com/watch?v=xSajXYJmxW4.

Large companies are understandably reluctant to have AI start picking up their phone calls—the technology kind of works, but often not very well. If they do take the plunge, they often end up spending months tuning the prompts for just one use-case, and sometimes never even end up releasing the voice bot.

The problem is two-sided: it's non-trivial to specify the exact way a bot should behave using plain language, and it's tedious to ensure the LLM always follows your instructions the way you intended them.

Existing voice AI solutions are a pain to set up for complex use cases. They require months of prompting all edge cases before going live, and then months of monitoring and improving prompting afterwards. We do that better than human prompters, and much faster, by running a continuous analysis + testing loop.

Our tech is roughly divided into three subcomponents: core library, voice server, and self-improvement logic. Core library models and executes the multi-stage (think n8n-style) voice agents. For the voice server we are using the ol’ reliable cascading way of STT->LLM->TTS. We tried out the voice-to-voice models, and although they felt really great to talk to, function-calling performance was expectedly much worse, so we are still waiting for them to get better.

The self-improvement works by first taking conversation metrics and evaluation results to produce ‘feedback’, i.e. specific ideas how the voice agent setup could be improved. After enough feedback is collected, we trigger a run of a specialized self-improvement agent. It is a cursor-style AI with access to various tools that changes the main voice agent. It can rewrite prompts, configure a stage to use a summarized conversation instead of a full one, and more. Each iteration produces a new snapshot of the agent, enabling us to route a small part of the traffic to it and promote it to production if things look ok. This loop can be set to run without any human involvement, thus making agents self-improve.

Leaping is use-case agnostic, but we currently focus on inbound customer support (travel, retail, real estate, etc.) and lead pre-qualification (medicare, home services, performance marketing) since we have a lot of success stories there.

We started out in Germany since that’s where we were in university, but initially growth was challenging. We decided to target enterprise customers right away and they showed reluctance to adopt voice AI as the front-door ‘face’ of their company. Additionally, for an enterprise with thousands of calls daily, it is infeasible to monitor all the calls and tune agents manually. To address their very valid concerns, we put all effort into reliability—and still haven’t gotten around to offering self-serve access, which is one reason we don’t have fixed pricing yet. (Also, with some clients we have outcome-based pricing, i.e. you pay nothing for calls that didn't convert a lead, only the ones that did.)

Things picked up momentum ever since we got into YC and moved to the US, but the cautious sentiment is also present here if you try to sell to big enterprises. We believe that doing evals, simulation, and A/B testing really really well is our competitive edge and what will enable us to solve large, sensitive use cases.

We’d love to hear your thoughts and feedback!

Comments

shahahmed•8h ago

congrats on launching! how are ya'll managing evals?

kevinwu2981•7h ago

Thanks! We provide eval templates that can be applied on specific stages or the whole conversation. Users can specify their own evals that can be as granular as they'd like. We're also working on conversation simulation feature that lets users quickly iterate on evals via simulating previous real conversations and seeing if the eval output aligns with human judgement.

P.S. Arkadiy is locked out of his HN account due to the anti-procrastination settings. HN team, can you plz help? :)

koakuma-chan•8h ago

How do you compare to livekit? I don't see any docs on your website.

kevinwu2981•7h ago

Livekit started as an infra for real-time audio/video applications. We are actually using them for WebRTC. They recently started growing into the voice AI space, but are still more of an infra solution, while we are an end-to-end platform.

What sets us apart is multi-stage conversation modeling, out-of-the-box evals, and self-improvement!

triyambakam•7h ago

Is this comparable to VAPI?

kevinwu2981•7h ago

Comparable in some aspects. Their focus is dev-tooling, while we are a mid-market and enterprise solution, geared towards enabling non-technical users at those companies, by using our tools, to easily create voice AI agents for customer service, lead qualification and ops use cases

hajrice•7h ago

Impressive demo, just wish I didn't have to request a demo and could just sign up.

Request a demo button also does nothing other than change the text on success - not sure if it even went through...

kevinwu2981•7h ago

I got the demo request:) Let me reply to you

isabelleilyia•7h ago

congrats! Some time ago we were giving client intake in legal a try with a voice AI product, but we never were able to get the success rate higher than really low numbers (especially with sensitive use cases like legal where people will reject the call instantly if it's a bot). Have you guys seen use cases like this? What ranges of success rates/engagement times have you seen?

kevinwu2981•6h ago

Why do you think it didn't work out in legal? We currently don't focus on that domain.

In general, we currently have really high success rates with relatively constrained use cases, such as lead qualification and well scoped customer service use cases (e.g., appointment booking, travel cancellation).

In general, voice AI is hard because WYSIWYG (there is no human in the loop between what the bot is saying and what the person on the other side gets to hear). Not sure about legal, but for more complex use cases (e.g., product refunds in retail), there are many permutations in how two different customers might frame the same issue and so it might be harder to accurately instruct the AI agent in a way to guarantee high automation results (given plentitude of edge cases).

It is our belief therefore that voice AI works the best, when the bot is leading the conversation and it is always very clear what the next steps are...

isabelleilyia•5h ago

I think the problem relates to the core value proposition of automating an intake department with voice AI. The best voice AI customer is in an industry in which there is a clear increase in value that comes with the ability to handle a larger mass of calls. This was not the case in the legal world, when one missed client might be a loss of millions (and many firms would live off of < 10 successful cases a year).

Therefore I think the verticals of customer service and lead pre-qualification make a lot more sense. Since you guys have the numbers, I am curious to learn more about the way you define constraints for the bot and how often calls in these verticals deviate from these constraints.

I'm also curious about your opinions/if you've seen any successful use cases where the bot has to be a bit more "creative" to either string together information given to it or make reasonable extrapolations beyond the information it has.

lostmsu•6h ago

Have you tried your solution in noisy environments? Like a call to a person in a restaurant.

akyshnik•4h ago

Noisy is ok, but it doesn't work that well when there are multiple clear speakers and not much noise. We are planning to add speaker diarization to address this.

joelthelion•5h ago

Your demo is nice, but why don't you show a call? That would be a lot more convincing...

akyshnik•4h ago

Only for the data privacy reasons

xp84•3h ago

Weird, because it seems like the demo video is pretend data anyway ("Mr. Smith", etc). I agree, I would like to see a more fully-baked demo where you connect it to a testing CRM and a toy order api and get it to answer several customer queries using live information.

ajeet•5h ago

Congrats on the launch! I work in this space, and fwiw I strongly agree with the idea of A/B testing + continuous improvement. I have found that it is relatively easy to setup A/B tests, much harder for stakeholders to draw the right conclusions.

vinodhkps•5h ago

what does the feedback loop look like to your agents - wonder how hard it will be to generalize metrics across these agents!

akyshnik•4h ago

feedback is generated based on evals. example: eval: function foo wasn't triggered even though [...]

feedback (exaggerated): 1. change stage prompt 2. change function description 3. add extra instructions to the end of the context

metrics are easy to generalize (e.g. call transfer rate), but baseline is different for each agent, so we're interpreting only the changes, not the absolute values (in the context of self-improvement).

costcopizza•4h ago

Very impressive! How many jobs do you estimate this could displace?

akyshnik•3h ago

It's a huge industry, so a lot. Job is really stressful and has a lot of employee churn, so it's not really something I feel bad about. Pressing elevator buttons was a job too back then

lostmsu•3h ago

What framework did you use for flow building?

ljclifford•3h ago

Super awesome demo! The contact center market, including inbound customer support, is incredibly ripe for disruption, and I'm sure you guys will be on the forefront of that.

Kinda funny how many amazing CX companies start in Germany!

I’m the CEO & founder of Rime, so I’ve been following your progress with real interest. Feel free to reach out and I’d love to explore ways we might collaborate. Until then, wishing you tons of success on this big milestone!

cootsnuck•3h ago

How well does this scale? Like how many simultaneous calls can a single voice agent handle through your platform?

AndrewKemendo•3h ago

I want this as an option to handle all my personal calls

I built a skeleton of an iOS app that managed my calls such that I could choose to answer, decline or send to my chat bot

So it gets real data from all my regular calls and in my state (1 party consent) I don’t need anyone’s permission to record every call. So that data kicks off a fine tuning running that can run overnight or locally to improve my personal model

My plan was to use whisper and a local model with my voice clone and it would talk with everyone I didn’t want to eventually to the point where I don’t ever talk with any person I don’t want to

I would pay you for a local way to do that, however I’d NEVER give you that data - but I’m sure plenty of people would

Postgres LISTEN/NOTIFY does not scale

Show HN: Pangolin – Open source alternative to Cloudflare Tunnels

What is Realtalk’s relationship to AI? (2024)

Show HN: Open source alternative to Perplexity Comet

Batch Mode in the Gemini API: Process More for Less

FOKS: Federated Open Key Service

Graphical Linear Algebra

Flix – A powerful effect-oriented programming language

Measuring the impact of AI on experienced open-source developer productivity

Belkin ending support for older Wemo products

Red Hat Technical Writing Style Guide

Yamlfmt: An extensible command line tool or library to format YAML files

Launch HN: Leaping (YC W25) – Self-Improving Voice AI

Turkey bans Grok over Erdoğan insults

How to prove false statements: Practical attacks on Fiat-Shamir

eBPF: Connecting with Container Runtimes

Regarding Prollyferation: Followup to "People Keep Inventing Prolly Trees"

Show HN: Cactus – Ollama for Smartphones

Grok 4

Analyzing database trends through 1.8M Hacker News headlines

Not So Fast: AI Coding Tools Can Reduce Productivity

Diffsitter – A Tree-sitter based AST difftool to get meaningful semantic diffs

Matt Trout has died

Is Gemini 2.5 good at bounding boxes?

The ChompSaw: A Benchtop Power Tool That's Safe for Kids to Use

Foundations of Search: A Perspective from Computer Science (2012) [pdf]

Show HN: Typeform was too expensive so I built my own forms

Final report on Alaska Airlines Flight 1282 in-flight exit door plug separation

Radiocarbon dating reveals Rapa Nui not as isolated as previously thought

Optimizing a Math Expression Parser in Rust

Launch HN: Leaping (YC W25) – Self-Improving Voice AI

Comments

Postgres LISTEN/NOTIFY does not scale

Show HN: Pangolin – Open source alternative to Cloudflare Tunnels

What is Realtalk’s relationship to AI? (2024)

Show HN: Open source alternative to Perplexity Comet

Batch Mode in the Gemini API: Process More for Less

FOKS: Federated Open Key Service

Graphical Linear Algebra

Flix – A powerful effect-oriented programming language

Measuring the impact of AI on experienced open-source developer productivity

Belkin ending support for older Wemo products

Red Hat Technical Writing Style Guide

Yamlfmt: An extensible command line tool or library to format YAML files

Launch HN: Leaping (YC W25) – Self-Improving Voice AI

Turkey bans Grok over Erdoğan insults

How to prove false statements: Practical attacks on Fiat-Shamir

eBPF: Connecting with Container Runtimes

Regarding Prollyferation: Followup to "People Keep Inventing Prolly Trees"

Show HN: Cactus – Ollama for Smartphones

Grok 4

Analyzing database trends through 1.8M Hacker News headlines

Not So Fast: AI Coding Tools Can Reduce Productivity

Diffsitter – A Tree-sitter based AST difftool to get meaningful semantic diffs

Matt Trout has died

Is Gemini 2.5 good at bounding boxes?

The ChompSaw: A Benchtop Power Tool That's Safe for Kids to Use

Foundations of Search: A Perspective from Computer Science (2012) [pdf]

Show HN: Typeform was too expensive so I built my own forms

Final report on Alaska Airlines Flight 1282 in-flight exit door plug separation

Radiocarbon dating reveals Rapa Nui not as isolated as previously thought

Optimizing a Math Expression Parser in Rust