Show HN: Cancer diagnosis makes for an interesting RL environment for LLMs

46•dchu17•2mo ago

Hey HN, this is David from Aluna (YC S24). We work with diagnostic labs to build datasets and evals for oncology tasks.

I wanted to share a simple RL environment I built that gave frontier LLMs a set of tools that lets it zoom and pan across a digitized pathology slide to find the relevant regions to make a diagnosis. Here are some videos of the LLM performing diagnosis on a few slides:

(https://www.youtube.com/watch?v=k7ixTWswT5c): traces of an LLM choosing different regions to view before making a diagnosis on a case of small-cell carcinoma of the lung

(https://youtube.com/watch?v=0cMbqLnKkGU): traces of an LLM choosing different regions to view before making a diagnosis on a case of benign fibroadenoma of the breast

Why I built this:

Pathology slides are the backbone of modern cancer diagnosis. Tissue from a biopsy is sliced, stained, and mounted on glass for a pathologist to examine abnormalities.

Today, many of these slides are digitized into whole-slide images (WSIs)in TIF or SVS format and are several gigabytes in size.

While there exists several pathology-focused AI models, I was curious to test whether frontier LLMs can perform well on pathology-based tasks. The main challenge is that WSIs are too large to fit into an LLM’s context window. The standard workaround, splitting them into thousands of smaller tiles, is inefficient for large frontier LLMs.

Inspired by how pathologists zoom and pan under a microscope, I built a set of tools that let LLMs control magnification and coordinates, viewing small regions at a time and deciding where to look next.

This ended up resulting in some interesting behaviors, and actually seemed to yield pretty good results with prompt engineering:

- GPT 5: explored up to ~30 regions before deciding (concurred with an expert pathologist on 4 out of 6 cancer subtyping tasks and 3 out of 5 IHC scoring tasks)

- Claude 4.5: Typically used 10–15 views but similar accuracy as GPT-5 (concurred with the pathologist on 3 out of 6 cancer subtyping tasks and 4 out of 5 IHC scoring tasks)

- Smaller models (GPT 4o, Claude 3.5 Haiku): examined ~8 frames and were less accurate overall (1 out of 6 cancer subtytping tasks and 1 out of 5 IHC scoring tasks)

Obviously, this was a small sample set, so we are working on creating a larger benchmark suite with more cases and types of tasks, but I thought this was cool that it even worked so I wanted to share with HN!

Comments

n2d4•2mo ago

How would a human classify the cancers? I assume the LLM training data does not include a whole bunch of cancer samples, so assumably there are some rules that it follows?

> While there exists several pathology-focused AI models

Would also be curious how the LLM compares to this and other approaches. What's the performance of the models trained specifically on this task, and random guessing, compared to the expert pathologist? Correct me if I'm wrong but this seems like the sort of task where being right 90% of the time is not good enough, so even if the LLM beats other approaches, it still needs to close the gap to human performance.

dchu17•2mo ago

> What's the performance of the models trained specifically on this task, and random guessing, compared to the expert pathologist?

I should probably first clarify here, the disease classification tasks are about subtyping the type of cancer (i.e classifying a case as invasive ductal carcinoma of the breast) rather than just binary malignant/benign classification so random guessing is much more difficult and makes this model performance more impressive.

> Would also be curious how the LLM compares to this and other approaches.

There aren't a lot of public general purpose pathology benchmarks. There are some like (https://github.com/sinai-computational-pathology/SSL_tile_be...) but focus on just binary benign/malignant classification tasks and binary biomarker detection tasks.

I am currently working on self-hosting the available open-source models.

> this seems like the sort of task where being right 90% of the time is not good enough, so even if the LLM beats other approaches, it still needs to close the gap to human performance

Yep, your intuition is right here, and actually the expectation is probably closer to mid-high 90%, especially for FDA approval (and most AI tools position as co-pilots at the moment). There is obviously a long way to go, but what I find about interesting about this approach is that it allows LLMs to generalize across (1) a variety of tissue types and (2) pathology tasks such as IHC H-score scoring.

ytrt54e•2mo ago

You should read out to Eric Topol...

0xbeebs•2mo ago

very cool, have you tried some of the newer segmenting models to see if they make a difference? I've seen some in the past two weeks that look really effective...I wonder if it could help out the RL environment

dchu17•2mo ago

Nope I haven't, I can take a look and see if I can fit it in

Utkarsh_Mood•2mo ago

Do you think finetuning these LLMS would bring about comparable results to specific models trained for this?

dchu17•2mo ago

I think so. It feels like there is more to be squeezed from just better prompts but was going to play around with fine-tuning Qwen3

Utkarsh_Mood•2mo ago

fair enough. I wonder if fine-tuning over different modalities like IMC, H&E etc would help it generalize better across all

dchu17•2mo ago

Yeah I think one of the things that would be interesting is to see how well it generalizes across tasks. It seems like the existence of pathology foundation models means there is certainly a degree of generalizability (at least across tissues) but I am not too sure yet about generalizability across different modalities (there are some cool biomarker-prediction models though)

areoform•2mo ago

Did you fine tune GPT 5, Sonnet 4.5, or any of the other models? Or, were the models able to do this "out of the box?"

Utkarsh_Mood•2mo ago

none of the models you mentioned are open source...

Antitoxic6185•2mo ago

Doesn't mean they can't be fine-tuned.

https://learn.microsoft.com/en-us/azure/ai-foundry/openai/ho...

dchu17•2mo ago

Nope, I just did some prompt engineering on ootb models. I thought about doing some fine-tuning on like Qwen but think that there is still more performance to be squeezed out with just prompts here.

austinwang115•2mo ago

Wow this is pretty interesting. Excited to see the benchmark!

xrd•2mo ago

Fascinating stuff.

For some reason, this reminds me the way video encoders compress video:

https://en.wikipedia.org/wiki/Video_compression_picture_type...

It makes me wonder if you could use a similar technique (iframes, bframes or pframes) to get the diff of a "normal" WSI and then train on pattern recognition of those.

These different frames are used to reduce network transmission costs, but it feels similar to the context window if you squint at it as a throughput problem rather than a context window size problem.

It feels like there would be a lot of tools and codecs you could leverage here.

dchu17•2mo ago

I've been thinking a bit more about better ways to build the tooling around it, I don't know much about video compression to be fully transparent but will read up on it.

I have been running into some problems with memory management here as each later frame needs to have a degree of context of the previous frames... (currently I just do something simple like pass in the previous frame and the first reference frame into context) maybe I can look into video compression and see if there is any inspiration there

lawrencechen•2mo ago

I wonder if navigation plays a significant role in performance. If you just randomly select 15 frames (presumably with interesting pixels), will the model perform similarly well?

dchu17•2mo ago

Thought about this too. I think there are two broad LLM capabilities here that are kind of currently tangled up in this eval:

1. Can an LLM navigate a slide effectively (i.e find all relevant regions of interest)? 2. Given a region of interest, can an LLM make the correct assessment?

I need to come up with a better test here in general but yep I'm thinking about this

gwerbret•2mo ago

Interesting work. Some thoughts:

First, your business model isn't really clear, as what you've described so far sounds more like a research project than a go-to-market premise. Computational pathology is a crowded market, and the main players all have two things in common: access to huge numbers of labeled whole-slide images, and workflows designed to handle such images. Without the former, your project sounds like a non-starter, and given the latter, the idea you've pitched doesn't seem like an advantage. Notably, some of the existing models even have open weights (e.g. Prov-GigaPath, CTransPath).

Second, you've talked about using this approach to make diagnoses, but it's not clear exactly how this would be pitched as a market solution. The range of possible diagnoses is almost unlimited, so a useful model would need training data for everything (not possible). My understanding is that foundation models solve this problem by focusing on one or a few diagnoses in a restricted scope, e.g. prostate cancer in prostate core biopsies. The other approach is to screen for normal in clearly-defined settings, e.g. Pap smears, so that anything that isn't "normal" is flagged for manual review. Either approach, as you can see, demands a very different training and market positioning strategy.

Finally, do you have pathologists advising you, and have you done any sort of market analysis? Unless you're already a pathologist (and probably even if you were), I suspect that having both would be of immense value in deciding a go-forward plan.

All the best!

dchu17•2mo ago

Hi, thanks for the comment! Just wanted to respond to some of comments here:

>> First, your business model isn't really clear, as what you've described so far sounds more like a research project than a go-to-market premise.

This is not really a core component of our business but more so was just something cool that I built and wanted to share!

>> Computational pathology is a crowded market, and the main players all have two things in common: access to huge numbers of labeled whole-slide images, and workflows designed to handle such images. Without the former, your project sounds like a non-starter, and given the latter, the idea you've pitched doesn't seem like an advantage. Notably, some of the existing models even have open weights (e.g. Prov-GigaPath, CTransPath).

We have partnerships with a few labs to get access to a large amount of WSIs, both H&E and IHC, but our core business really isn't building workflow tools for pathologists at the moment.

>> Second, you've talked about using this approach to make diagnoses, but it's not clear exactly how this would be pitched as a market solution. The range of possible diagnoses is almost unlimited, so a useful model would need training data for everything (not possible). My understanding is that foundation models solve this problem by focusing on one or a few diagnoses in a restricted scope, e.g. prostate cancer in prostate core biopsies.

I agree with you in that I don’t necessarily think this is really a market solution at the current state (it isn't even close to accurate enough), but I think that the beauty of this solution is the general-purpose nature of it in that it can work not only across tissue types, but also different pathology tasks like IHC scoring along with cancer sub typing. The value of foundation models is in the fact that tasks can generalize. For example, part of what made this super interesting to me was the fact that the general purpose foundation models like GPT 5 are able to even perform this super niche task! Obviously there are path-specific foundation models too that have their own ViT backbones, but it is pretty incredible that GPT 5 and Claude 4.5 can perform at this level already.

Yes to the best of my knowledge, most FDA-approved solutions are point solutions, but I am not yet convinced this is the best way to deploy solutions in the long-term. For example, there will always be rare diseases where there isn't enough of a market for there to be a specialized solution for and in those cases, general-purpose models that can generalize to some degree may be crucial.

The Genus Amanita

We have broken SHA-1 in practice

Ask HN: Was my first management job bad, or is this what management is like?

Ask HN: How to Reduce Time Spent Crimping?

KV Cache Transform Coding for Compact Storage in LLM Inference

A quantitative, multimodal wearable bioelectronic device for stress assessment

Why Big Tech Is Throwing Cash into India in Quest for AI Supremacy

How to shoot yourself in the foot – 2026 edition

Eight More Months of Agents

From Human Thought to Machine Coordination

The new X API pricing must be a joke

Show HN: RMA Dashboard fast SAST results for monorepos (SARIF and triage)

Show HN: Source code graphRAG for Java/Kotlin development based on jQAssistant

Python Only Has One Real Competitor

Tmux to Zellij (and Back)

Ask HN: How are you using specialized agents to accelerate your work?

Passing user_id through 6 services? OTel Baggage fixes this

DavMail Pop/IMAP/SMTP/Caldav/Carddav/LDAP Exchange Gateway

Visual data modelling in the browser (open source)

Show HN: Tharos – CLI to find and autofix security bugs using local LLMs

Oddly Simple GUI Programs

The New Playbook for Leaders [pdf]

Interactive Unboxing of J Dilla's Donuts

OneCourt helps blind and low-vision fans to track Super Bowl live

Rudolf Vrba

Autism Incidence in Girls and Boys May Be Nearly Equal, Study Suggests

Wellness Hotels Discovery Application

NASA delays moon rocket launch by a month after fuel leaks during test

Sebastian Galiani on the Marginal Revolution

Ask HN: Are we at the point where software can improve itself?

Show HN: Cancer diagnosis makes for an interesting RL environment for LLMs

Comments

The Genus Amanita

We have broken SHA-1 in practice

Ask HN: Was my first management job bad, or is this what management is like?

Ask HN: How to Reduce Time Spent Crimping?

KV Cache Transform Coding for Compact Storage in LLM Inference

A quantitative, multimodal wearable bioelectronic device for stress assessment

Why Big Tech Is Throwing Cash into India in Quest for AI Supremacy

How to shoot yourself in the foot – 2026 edition

Eight More Months of Agents

From Human Thought to Machine Coordination

The new X API pricing must be a joke

Show HN: RMA Dashboard fast SAST results for monorepos (SARIF and triage)

Show HN: Source code graphRAG for Java/Kotlin development based on jQAssistant

Python Only Has One Real Competitor

Tmux to Zellij (and Back)

Ask HN: How are you using specialized agents to accelerate your work?

Passing user_id through 6 services? OTel Baggage fixes this

DavMail Pop/IMAP/SMTP/Caldav/Carddav/LDAP Exchange Gateway

Visual data modelling in the browser (open source)

Show HN: Tharos – CLI to find and autofix security bugs using local LLMs

Oddly Simple GUI Programs

The New Playbook for Leaders [pdf]

Interactive Unboxing of J Dilla's Donuts

OneCourt helps blind and low-vision fans to track Super Bowl live

Rudolf Vrba

Autism Incidence in Girls and Boys May Be Nearly Equal, Study Suggests

Wellness Hotels Discovery Application

NASA delays moon rocket launch by a month after fuel leaks during test

Sebastian Galiani on the Marginal Revolution

Ask HN: Are we at the point where software can improve itself?