Launch HN: BlankBio (YC S25) – Making RNA Programmable

63•antichronology•5mo ago

Hey HN, we're Phil, Ian and Jonny, and we're building BlankBio (https://blank.bio). We're training RNA foundation models to power a computational toolkit for therapeutics. The first application is in mRNA design where our vision is for any biologist to design an effective therapeutic sequence (https://www.youtube.com/watch?v=ZgI7WJ1SygI).

BlankBio started from our PhD work in this area, which is open-sourced. There’s a model [2] and a benchmark with APIs access [0].

mRNA has the potential to encode vaccines, gene therapies, and cancer treatments. Yet designing effective mRNA remains a bottleneck. Today, scientists design mRNA by manually editing sequences AUGCGUAC... and testing the results through trial and error. It's like writing assembly code and managing individual memory addresses. The field is flooded with capital aimed at therapeutics companies: Strand ($153M), Orna ($221M), Sail Biomedicines ($440M) but the tooling to approach these problems remains low-level. That’s what we’re aiming to solve.

The big problem is that mRNA sequences are incomprehensible. They encode properties like half-life (how long RNA survives in cells) and translation efficiency (protein output), but we don't know how to optimize them. To get effective treatments, we need more precision. Scientists need sequences that target specific cell types to reduce dosage and side effects.

We envision a future where RNA designers operate at a higher level of abstraction. Imagine code like this:

  seq = "AUGCAUGCAUGC..."
  seq = BB.half_life(seq, target="6 hours")
  seq = BB.cell_type(seq, target="hepatocytes")
  seq = BB.expression(seq, level="high")

To get there we need generalizable RNA embeddings from pre-trained models. During our PhDs, Ian and I worked on self-supervised learning (SSL) objectives for RNA. This approach allows us to train on unlabeled data and has advantages: (1) we don't require noisy experimental data, and (2) the amount of unlabeled data is significantly greater than labeled. However the challenge is that standard NLP approaches don't work well on genomic sequences.

Using joint embedding architecture approaches (contrastive learning), we trained model to recognize functionally similar sequences rather than predict every nucleotide. This worked remarkably well. Our 10M parameter model, Orthrus, trained on 4 GPUs for 14 hours, beats Evo2, a 40B parameter model trained on 1000 GPUs for a month [0]. On mRNA half-life prediction, just by fitting a linear regression on our embeddings, we outperform supervised models. This work done during our academic days is the foundation for what we're building. We're improving training algorithms, growing the pre-training dataset, and making use of parameter scaling with the goal of designing effective mRNA therapeutics.

We have a lot to say about why other SSL approaches work better than next-token prediction and masked language modeling: some of which you can check out in Ian's blog post [1] and our paper [2]. The big takeaway is that the current approaches of applying NLP to scaling models for biological sequences won't get us all the way there. 90% of the genome can mutate without affecting fitness so training models to predict this noisy sequence results in suboptimal embeddings [3].

We think there are strong parallels between the digital and RNA revolutions. In the early days of computing, programmers wrote assembly code, managing registers and memory addresses directly. Today's RNA designers are manually tweaking sequences, improving stability or reduce immunogenicity through trial and error. As compilers freed programmers from low-level details, we're building the abstraction layer for RNA.

We currently have pilots with a few early stage biotechs proving out utility of our embeddings and our open source model is used by folks at Sanofi & GSK. We're looking for: (1) partners working on RNA adjacent modalities (2) feedback from anyone who's tried to design RNA sequences what were your pain points?, and (3) Ideas for other applications! We chatted with some biomarker providing companies, and some preliminary analyses demonstrate improved stratification.

Thanks for reading. Happy to answer questions about the technical approach, why genomics is different from language, or anything else.

- Phil, Ian, and Jonny

founders@blankbio.com

[0] mRNABench: https://www.biorxiv.org/content/10.1101/2025.07.05.662870v1

[1] Ian’s Blog on Scaling: https://quietflamingo.substack.com/p/scaling-is-dead-long-li...

[2] Orthrus: https://www.biorxiv.org/content/10.1101/2024.10.10.617658v3

[3] Zoonomia: https://www.science.org/doi/10.1126/science.abn3943

Comments

anyg•5mo ago

How are the RNA sequences used? Are there any clinical trials running?

antichronology•5mo ago

There is a number of different technologies. Some of the big ones are:

- mRNA therapies: These therapies deliver a synthetically created messenger RNA (mRNA) molecule, typically protected within a lipid nanoparticle (LNP), to a patient's cells. The cell's own machinery then uses this mRNA as a temporary blueprint to produce a specific protein.

The big example here is CAR-T therapy from Capstan which just got acquired for 2.1B. Their asset,CPTX2309 , is currently in Phase 1. Previously to do Car-T therapy you had to extract a patient's T-cells and genetically engineer them in a special facility. Now the mRNA gets delivered directly to the patient's t cells which significantly lowers the cost and technical hurdles.

- RNA interferences (RNAi): Used for gene expression knockdown through natural cellular mechanisms for viral detection. The big example here is Alnylam with 5 approved therapies and a number in clinical trials.

- Antisense Oligonucleotides (ASOs): Short single stranded RNA molecules that get delivered directly to the cell and target an existing mRNA. The big win here is Spinraza which is the first approved treatment for Spinal Muscular Atrophy (SMA) which previously didn't have a treatment. The Spinraza clinical trial (ENDEAR) was so effective that they deemed it unethical to continue it because the control arm wasn't receiving the treatment. Prior to Spinraza most patients would pass away prior to two years of age.

tennysont•5mo ago

Fun to see talk of "a compiler for DNA"---I've been hoping for that for a long time.

I have to admit, at a _glance_ this feels like a promising idea with few results and lots of marketing. I'll try to be clear about my confusion, feel free to explain if I'm off base.

- There's not a lot of talk of your "ground truth" for evaluations. Are you using mRNABench?

- Has you mRNABench paper been peer reviewed? You linked a preprint. (I know paper submission can be touch or stressful, and it's a superficial metric to be judged on!)

- Do any of your results suggest that this foundation model might be any good on out of sequence mRNA sequences? If not, then is the (current) model supposed to predict properties of natural mRNA sequences rather than of synthetic mRNA sequences?

- Did a lot mRNA sequences have experimental verification of their predicted properties? At a quick glance, I see this 66 number in the paper---but I truly have no idea.

I'm super happy to praise both incremental progress and putting forth a vision, I just also want to have a clear understanding of the current state-of-the-art as well!

antichronology•5mo ago

> ground truth

Hey yes, the ground truth for our evaluations is measured experimental data. Our models are benchmarked using mRNABench, which aggregates results from high-throughput wet lab experiments.

Our goal, however, is to move beyond predicting existing experimental outcomes. We intend to design novel sequences and validate their function in our own lab. At that stage, the functional success of the RNA we design will become the ground truth.

> peer reviewed?

Both mRNA bench and Orthrus are in submission (at a big ML conference and a big name journal) - unfortunately the academic systems move slow but we're working on getting them out there.

> synthetic mRNA sequences

I think you're asking on generalizing out of distribution to unnatural sequences. There are two ways that we do this: (1) There are these screens called Massively Parallel Reporter Assays (MPRAs) and we eval for example on https://pubmed.ncbi.nlm.nih.gov/31267113/

Here all the sequences are synthetic and randomly designed and we do observe generalization. Ultimately it depends on the problem that we're tackling: some tasks like gene therapy design require endogenous sequences.

(2) The other angle is variant effect prediction (VEP). It can be thought of as a counterfactual prediction problem where you ask the model whether a small change in the input predicts a large change in the output. This is a good example of the study (https://www.biorxiv.org/content/10.1101/2025.02.11.637758v2)

> experimental verification of their predicted properties

all our model evaluations are predictions of experimental results! The datasets we use are collections of wet lab measurements, so the model is constantly benchmarked against ground-truth biology.

The evaluation method involves fitting a linear probe on the model's learned embeddings to predict the experimental signal. This directly tests whether the model's learned representation of an RNA sequence contains a linear combination of features that can predict its measured biological properties.

Thanks for the feedback I understand the caution around pre-prints. We believe a self-supervised learning approach is well-suited for this problem because it allows the model to first learn patterns from millions of unlabeled sequences before being fine-tuned on specific, and often smaller, experimental datasets.

andy99•5mo ago

> mRNABench

Just curious, in other areas of ML, I think it's widely acknowledged that benchmarks have pretty limited real world value, just end up getting saturated, and (my view) are all pretty correlated, regardless of their ostensible speciality and don't really tell you that much.

Do you think mRNABench is different, or where do you see the limitations? Do you imagine this or any benchmark will be useful for anything beyond comparing how different models do on the benchmark?

antichronology•5mo ago

I watched an interview with one of the co-founders of Anthropic where his point is that although benchmarks saturate they're still an important signal for model development.

We think the situation is similar here - one the challenges is aligning the benchmark with the function of the models. Genomic benchmarks for gLMs and RNA foundation models have been very resistant to staturation.

I think in NLP the problem is that they are victims of their own success where the models can be overfit to particular benchmarks really fast.

In genomics we're a bit behind. A good paper on this is DartEval where they provide levels of complexity https://arxiv.org/abs/2412.05430

in RNA the models work much better than DNA prediction but it's key to have benchmarks to measure progress.

antichronology•5mo ago

Here is the link for benchmarks and their utility: https://youtu.be/JdT78t1Offo?t=1444

"We have internal benchmarks. Yeah. But we don't we don't publish them."

"we have internal benchmarks that the team focuses on and improving and then we also have a bunch of tasks like I think that accelerating our own engineers is like a top top priority for us"

The equivalent for us would be to ultimate looking to improve experimental results. Benchmarks are a good intermediate point but not the ultimate goal

pjsample•5mo ago

Hi, I'm the lead author of the human 5' UTR paper. It was a nice surprise seeing it linked on HN and I'm happy to see that it's providing value for you all. Looking forward to watching your team's progress!

antichronology•5mo ago

Huge fan of the work! I'm a big fan of papers from Seelig lab :)

tennysont•5mo ago

Thank you for writing out these answers! Your patience was noticed and appreciated.

It feels like things are further ahead in synthetic biology than I realized and that so so so exciting!

(yes, I meant "out of distribution"---but in today's day 'n age typos are proof of human creation :p )

mfld•5mo ago

Maybe another application could be the ranking of candidate variants for cancer immunotherapy? As far as I know, lncRNAs are sometimes assessed.

antichronology•5mo ago

We haven't looked into this deeply yet sounds interesting. Do you have any resources where to start looking at this? Feel free to reach out to us

founders@blankbio.com

carlsborg•5mo ago

Cool. Could we train a "potential oncoprotein" classifier on Orthrus embeddings? IMO self serve diagnosis and detection is a far larger market than synthesis.

antichronology•5mo ago

This is a really interesting direction. There is this big field of Cell Free (cfRNA) cancer detection. We talked to a few people in the field and think that embedding sequences for this direction could be really valuable. One challenge here is that it's hard to set up evaluation tasks since the public data is scarce

carlsborg•5mo ago

Maybe we can crowd source data. My platform, currently in beta, has ai assistants for compute infrastructure and biology and will soon let people to do self serve research on their own omics data using models like yours. So there could be a monetization path too if enough people start looking their own cell data (which they might once they fully understand the risks of engineered pathogens, and certainly will when the risks materialize and start hitting home). Email in bio if you want to brainstorm.

antichronology•5mo ago

That would be really cool. Navigating SRA and mining out reasonable $ relevant tasks is a huge bottleneck.

I find it takes a large amount of effort to parse what the authors are doing, whether the data is high quality, and how to pre-process it in a way that makes sense for the task at hand.

Would love to chat more about how you're thinking of evaluating quality of these agents.

westurner•5mo ago

The other day I paired an article on pyroptosis caused by marine spongiibacter exopolysaccharide and an mRNA Cancer vaccine article. I started to just forward the article on bacterially-induced pyroptosis to the cancer vaccine researchers but stopped to ask an LLM whether the approaches shared common pathways or mechanisms of action and - fish my wish - they are somehow similar and I had asked a very important question that broaches a very active area of research.

How would your AI solution help with finding natural analogs of or alternatives to or foils of mRNA procedures?

westurner•5mo ago

Can EPS3.9 cause pyroptosis cause IFN-I cause epitope spreading for cancer treatment?

Re: "Sensitization of tumours to immunotherapy by boosting early type-I interferon responses enables epitope spreading" (2025) https://www.nature.com/articles/s41551-025-01380-1

How is this relevant to mRNA vaccines?:

"Ocean Sugar Makes Cancer Cells Explode" (2025) https://scitechdaily.com/ocean-sugar-makes-cancer-cells-expl... ... “A Novel Exopolysaccharide, Highly Prevalent in Marine Spongiibacter, Triggers Pyroptosis to Exhibit Potent Anticancer Effects” (2025) DOI: 10.1096/fj.202500412R https://faseb.onlinelibrary.wiley.com/doi/10.1096/fj.2025004...

antichronology•5mo ago

This is really interesting - I'm going to be honest I'm not an immunologist so this is my (LLM assisted) understanding of your comment:

The immune system recognizes a sugar as a PAMP, or Pathogen-Associated Molecular Pattern, which is a signature of a potential microbial threat.

This initiates pyroptosis an inflammatory form of programmed cell death causing the cell to burst. This rupture releases tumor antigens and DAMPs (Damage-Associated Molecular Patterns), which are "danger signals" from the dying cell

The release of DAMPs shifts the Tumor Microenvironment (TME) from an immunologically "cold" to a "hot" state, promoting a potent Type I Interferon (IFN-I) response.

This response recruits Antigen Presenting Cells (APCs), which engulf the newly released tumor antigens.

---

mRNA vaccines are somewhat of a parallel approach where the antigen selection and delivery happens manually. An mRNA vaccine delivers the encoding sequence for specific tumor antigens to drive production and presentation, training the immune system. One of the big challenges of this space is optimal antigen selection from the patient's tumor.

One thing I'm not fully clear on is why only tumor cell react to PAMP instead of healthy cells. Could be a promising approach but molecular biology is pretty tricky and the devil is always in the details.

JPLeRouzic•5mo ago

> "why only tumor cell react to PAMP instead of healthy cells"

I am not a scientist, but I believe that "normal" cells do not seek long-chain alien sugars like those produced by ocean bacteria. Conversely, "cancerous" cells may find these uncommon sugars appealing, and they consume sugar eagerly (Warburg effect).

After the alien sugars are metabolized, fragments migrate to the cell membrane and might be recognized by the immune system as foreign.

The fact that large molecules trigger Pyroptosis may be helpful.

forgotpwagain•5mo ago

I am totally onboard with the premise (as a TechBio-adjacent person), and some of the approaches you're taking (focused domain-specific models like Orthrus, rather than massive foundation models like Evo2).

I'm curious about what your strategy is for data collection to fuel improved algorithmic design. Are you building out experimental capacity to generate datasets in house, or is that largely farmed out to partners?

antichronology•5mo ago

We think that Orthrus can be applied in a bunch of ways to non-coding and coding RNA sequences but it's definitely fair we're a bit more focused on RNA sequences currently instead of non-coding parts of the genome like promoters and intergenic sequences.

For the data - Orthrus is trained on non experimentally collected data so our pre-training dataset is large by biological standards. It adds up to about 45 million unique sequences and assuming 1k tokens per sequence it's about 50b tokens.

We're thinking about this as large pre-training run on a bunch of annotation data from Refseq and Gencode in conjunction with more specialized Orthology datasets that are pooling data across 100s of species.

Then for specific applications we are fine tuning or doing linear probing for experimental prediction. For example we can predict half life using publicly available data collected by the awesome paper from: https://genomebiology.biomedcentral.com/articles/10.1186/s13...

Or translation efficency: https://pubmed.ncbi.nlm.nih.gov/39149337/

Eventually as we ramp up out wet lab data generation we're thinking about what does post-training look like? There is an RL analog here that we can use on these generalizable embeddings to demonstrate "high quality samples".

There are some early attempts at post-training in bio and I think it's a really exciting direction

forgotpwagain•5mo ago

Thanks for the response! This is very cool and sounds like a reasonable plan. Best of luck!

simianparrot•5mo ago

Literally the stuff of nightmares. Why are we doing this?

> As compilers freed programmers from low-level details, we're building the abstraction layer for RNA.

That’s all fun and games when it’s literally fun and games. When it’s mRNA injected into living beings it’s the stuff of nightmares.

Will technologists ever _ever_ stop and think for a second?

antichronology•5mo ago

Thanks for engaging. Do you mind elaborating on your stance?

From where we sit - there are people with diseases and mRNA is an effective way to revert them to a healthy state.

I'd be interested to hear more where you're coming from

gus_massa•5mo ago

I don't expect the new mRNA to be injected in humans directly. I guess it will be try in mice, then in other animals, then in a few humans, then in a bigger group, and then available to the general public. (Or something like that, I don't remember the details, IANAMD.)

It's the usual process for all new medicines. We already had a lot of bad cases with other potential medicines, so all new candidates must pass a lot of tests.

jryb•5mo ago

mRNA is a natural molecule produced by every living organism. All it does it cause a protein to be made. Where's the nightmare?

kylehotchkiss•5mo ago

Fascinating platform. I'm fairly new in my bio education but are you effectively finding sequences on NIH and then intelligently chaining them together?

I had some fun one evening asking Claude how I could string together sequences for an imaginary therapeutic and it gave me enough to put into alphafold and get a render :) (Worst therapeutic ever: deliver mRNA into macrophages to target those pesky bacteria who happily just choose to reside there)

Also: How do you plan to navigate the unfortunate part of our country trying to write mRNA out of the American vocabulary?

antichronology•5mo ago

> finding sequences on NIH

Almost! Yes most of the data is on NIH sub-institutes. For us we take most of the data from NCBI and intelligently pair it together. The training objective of our model takes pairs of sequences (thus the Joint Embedding Architecture) and trains the model to recognize that they are semantically similar but differ in appearance. This is conceptually similar to a lot of the contrastive learning literature from computer vision.

Sounds like a fun side project :)

There are some great tools out there for putting together plasmids for gene therapies where you can plug in different "elements". Promoters UTRs payloads - check out SnapGene I believe they have a free version.

I personally am hopeful that the political headwinds will blow over. When it comes to cancer vaccines it's one of the most exciting new modalities for treating cancer.

1 in 2 Americans are going to get cancer in their lifetime so no matter political affiliation, the need for health will ultimately drive people to invest in the modality.

Compiling Prolog to Forth [pdf]

Show HN: Cymatica – an experimental, meditative audiovisual app

GitBlack: Tracing America's Foundation

Horizon-LM: A RAM-Centric Architecture for LLM Training

We just ordered shawarma and fries from Cursor [video]

Correctio

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

Compiling Prolog to Forth [pdf]

Show HN: Cymatica – an experimental, meditative audiovisual app

GitBlack: Tracing America's Foundation

Horizon-LM: A RAM-Centric Architecture for LLM Training

We just ordered shawarma and fries from Cursor [video]

Correctio

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

Launch HN: BlankBio (YC S25) – Making RNA Programmable

Comments