Launch HN: RunRL (YC X25) – Reinforcement learning as a service

71•ag8•4mo ago

Hey HN, we’re Andrew and Derik at RunRL (https://runrl.com/). We've built a platform to improve models and agents with reinforcement learning. If you can define a metric, we'll make your model or agent better, without you having to think about managing GPU clusters.

Here's a demo video: https://youtu.be/EtiBjs4jfCg

I (Andrew) was doing a PhD in reinforcement learning on language models, and everyone kept...not using RL because it was too hard to get running. At some point I realized that someone's got to sit down and actually write a good platform for running RL experiments.

Once this happened, people started using it for antiviral design, formal verification, browser agents, and a bunch of other cool applications, so we decided to make a startup out of it.

How it works:

- Choose an open-weight base model (weights are necessary for RL updates; Qwen3-4B-Instruct-2507 is a good starting point)

- Upload a set of initial prompts ("Generate an antiviral targeting Sars-CoV-2 protease", "Prove this theorem", "What's the average summer high in Windhoek?")

- Define a reward function, using Python, an LLM-as-a-judge, or both

- For complex settings, you can define an entire multi-turn environment

- Watch the reward go up!

For most well-defined problems, a small open model + RunRL outperforms frontier models. (For instance, we've seen Qwen-3B do better than Claude 4.1 Opus on antiviral design.) This is because LLM intelligence is notoriously "spiky"; often models are decent-but-not-great at common-sense knowledge, are randomly good at a few domains, but make mistakes on lots of other tasks. RunRL creates spikes precisely on the tasks where you need them.

Pricing: $80/node-hour. Most models up to 14B parameters fit on one node (0.6-1.2 TB of VRAM). We do full fine-tuning, at the cost of parameter-efficiency (with RL, people seem to care a lot about the last few percent gains in e.g. agent reliability).

Next up: continuous learning; tool use. Tool use is currently in private beta, which you can join here: https://forms.gle/D2mSmeQDVCDraPQg8

We'd love to hear any thoughts, questions, or positive or negative reinforcement!

Comments

nextworddev•4mo ago

Is there any credence to the view that these startups are basically dspy wrappers

-_-•4mo ago

DSPy is great for prompt optimization but not so much for RL fine-tuning (their support is "extremely EXPERIMENTAL"). The nice thing about RL is that the exact prompts don't matter so much. You don't need to spell out every edge case, since the model will get an intuition for how to do its job well via the training process.

nextworddev•4mo ago

Isn’t the latest trend in RL mostly about prompt optimization as opposed to full fine tuning

ag8•4mo ago

prompt optimization is very cool, and we use it for certain problems! The main goal with this launch is to democratize access to "the real thing"; in many cases, full RL allows you to get the last few percent in reliability for things like complex agentic workflows where prompt optimization doesn't quite get you far enough.

There's also lots of interesting possibilities such as RLing a model on a bunch of environments and then prompt optimizing it on each specific one, which seems way better than, like, training and hot-swapping many LoRAs. In any case, _someone's_ ought to provide a full RL api, and we're here to do that well!

nextworddev•4mo ago

Thanks. Is this mainly for verifiable tasks or any general task

-_-•4mo ago

There needs to be some way of automatically assessing performance on the task, though this could be with a Python function or another LLM as a judge (or a combination!)

ag8•4mo ago

It's for any task that has an "eval", which is often verifiable tasks or ones that can be judged by LLMs (e.g. see [0]). There's also been recent work such as BRPO [1] and similar approaches to make more and more "non-verifiable" tasks have verifiable rewards!

[0]: https://runrl.com/blog/funniest-joke

[1]: https://arxiv.org/abs/2506.00103

omneity•4mo ago

Perhaps less about DSPy, and rather about this: https://github.com/OpenPipe/ART

-_-•4mo ago

ART is also great, though since it's built on top of Unsloth it's geared towards single GPU QLoRA training. We use 8 H100s as a standard, so we can handle larger models and full-parameter fine-tunes.

omneity•4mo ago

Interesting, do you have benchmarks on FFT vs QLoRA for RL?

ag8•4mo ago

we should publish some; the high-order effect seems to be that LoRAs significantly hurt small model performance vs FFT, with less of an effect for large models. This is maybe because large models have more built-in skills and thus a LoRA suffices to elicit the existing skill, whereas for small models you need to do more actual learning (holding # parameter updates constant). In general I think it's better to get a performant small model with FFT than a performant large model with a large LoRA, which is why we default to FFT, but I agree that we should publish more details here.

omneity•4mo ago

Thanks! Personally I found FFT is not necessarily a strict improvement over (Q)LoRA as it can sometimes more easily lead to instability in the model, hence the bit of extra scrutiny.

Curious to see your thoughts and results whenever you get something out.

ripbozo•4mo ago

Was excited to see something about reinforcement learning as I'm working on training an agent to play a game, but apparently all reinforcement learning nowadays is for LLMs.

ag8•4mo ago

Yeah, for better or worse, the way the median startup interfaces with AI these days is through an LLM API, and that's what all the workflows are built around, so that's what we're targeting. Though, depending on what you're trying to do, I wouldn't discount the use of starting with a pretrained model—there was that famous result from 2022 that showed that pretraining a model on _Wikipedia_ made training on Atari games more than twice as efficient [0]; these days, LLMs have huge amounts of priors about the real world that make them great starting points for a surprisingly diverse set of tasks (e.g. see the chemistry example in our video!)

[0]: https://arxiv.org/abs/2201.12122

-_-•4mo ago

Have you heard of https://puffer.ai? Might fit your use case

3s•4mo ago

This is really neat! Didn’t realize it could be this simple to run RL on models. Quick question: How would I specify the reward function for tool use? or is this something you automatically do for me when I specify the available tools and their uses?

ag8•4mo ago

Thanks! Our goal is to make rl "just work" with completely automated GPU provisioning/algorithm selection/SFT-warm up, but giving people the ability to switch away from the defaults if they want to.

The way tools currently work in the beta is you add tools via MCP to the configuration, and they get passed in as additional context for the model; the model might then choose to use a tool during inference; the tool is then automatically called and the output is returned as a tool message. If you really want to you could parse the tool output as part of reward calculation, but I expect you'd usually base the reward just on the model's completion. I could give more details if there's a specific tool setup you're envisioning!

-_-•4mo ago

To add to this, you can currently manually parse tool calls in your environment's step function, but we'll be rolling out a UI that makes this easier soon.

namibj•4mo ago

I'd love to see something that can RL an agent (of sorts) that interacts with an interactive theorem prover (like Lean4, Coq, or Isabelle/HOL), (probably/likely via a harness instead of plain shell-like interaction), and actively exploits that discovery itself is not harmful beyond the inference and oracle cost of investigating an abondoned branch.

I.e., it's not at all like a typical game, because at no point is "success rate without relying on rollback/savestate-reloading" something that actually matters. An agent that spends evenly on abandoned (exploratory) branches, and on the path that becomes part of the solution that the formal verifier checks to confirm, while having a near-100% solve rate for problems fed to the agent, is a VERY GOOD agent.

That's because this task unlike most RL tasks is one where the agent shall utilize discovery to log an interaction trace that can be trivially mechanically trimmed to a verifiable proof for the provided problem. I.e., the hard part is finding ANY path that solves, without spending exponential amounts of compute to brute force the problem over the bounded state size of practical relevance. Because that would be something that takes longer than the heat death of the universe: i.e.,it's theoretically impractical.

Most RL tasks want an agent that is particularly good at it's task; and while effort spent to find a proof is certainly something that matters (if only because lower cost means the agent can train on more instances with the same training budget), it's much less relevant than the solve rate itself (fraction of problems for which any verifiably-correct proof sequence can be found at some definable level of effort, expressed as e.g. number of shots, total compute budget for the instance, ratio of exploration nodes to those nodes that become part of the final proof sequence, etc.).

Considering that non-benchmark usage would mostly entail semi-synthetic crowd-sourced datasets that are open sub-instances from practical applications of formal verification, as well as more-synthetic instances from very coarse high-level questions (that get mechanically broken down into more-manageable chunks before the RL agent gets to work) like "given these more-specific rules of what is _actually_ UB and what is only UB in ANSI but actually defined in the specific toolchain that we use: does that C program over there contain ANY UB?" or "is there ANY way that input at that file/network-socket over there to that program over here could execute arbitrary code", there'd not be economic incentive to solve any given instance more than once, beyond what is necessary to make the RL training process itself stable.

That task also lends itself to semi-online learning as every supplied instance essentially pays once for a verified solution and the overall process should deliver solid ROI. Running a single GPU cluster/pile for both training and inference would allow higher utilization at the cost of running with some variable amount of latency between rolling out an episode and training on the completed episode's oracle-verified rewards.

ag8•4mo ago

Having an RL agent that's really good at search across some space sounds very powerful in general; "proofs-as-search" make this an appealing target. Back in the day, when I did more fundamental RL research, we worked on an extension of SoRB [0] where an additional meta-level target was learning improved heuristics to explore the search space faster; would be exciting to figure out what a good setup for doing things like this in LLM-policy-gradient world is these days!

[0]: https://arxiv.org/abs/1906.05253

papadiamantis9•4mo ago

Very neat! A) If I want to have a different grading rubric per example (and grade with an LLM as a judge), do I do this through the reward function? B) What's the pricing on the deployed API? (Is it per token?)

ag8•4mo ago

A) You could have an additional field in the jsonl file which says which rubric to use; then, your reward function could access this via `kwargs["rubric"]` and return a reward based on that example's preferred rubric;

B) currently, pricing on the deployed API is free, but the startup time is a few minutes and it's run on a small GPU node and is therefore not awfully fast. If you would like more production-level inference, email us at founders@runrl.com and we could set you up with something much faster (where we'd charge per token depending on model size)

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

First Proof

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Reinforcement Learning from Human Feedback

The Waymo World Model

Start all of your commands with a comma (2009)

France's homegrown open source online office suite

Vocal Guide – belt sing without killing yourself

The AI boom is causing shortages everywhere else

Software factories and the agentic moment

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

What Is Stoicism?

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

British drivers over 70 to face eye tests every three years

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: I spent 4 years building a UI design tool with only the features I use

An Update on Heroku

Show HN: If you lose your memory, how to regain access to your computer?

What Is Ruliology?

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

First Proof

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Reinforcement Learning from Human Feedback

The Waymo World Model

Start all of your commands with a comma (2009)

France's homegrown open source online office suite

Vocal Guide – belt sing without killing yourself

The AI boom is causing shortages everywhere else

Software factories and the agentic moment

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

What Is Stoicism?

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

British drivers over 70 to face eye tests every three years

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: I spent 4 years building a UI design tool with only the features I use

An Update on Heroku

Show HN: If you lose your memory, how to regain access to your computer?

What Is Ruliology?

Launch HN: RunRL (YC X25) – Reinforcement learning as a service

Comments