Inspired by Karpathy's nanochat, I developed a minimal, from scratch implementation of a visual language model for report generation from medical images. AI in healthcare models are rarely fully open-source, causing researchers to struggle adapting these models on their own data, and barriers to entry for newcomers in the field. I've made echovlm fully open-source. I provide a complete pipeline to train the model end-to-end using 120k publicly available imaging reports and matched synthetic videos for just $5 in 2 hours. To demonstrate echovlm in practice, I've included an inference example using a real video of my own heart. I'm looking forward to researchers using this for their experiments, students to learn about vlms for medicine and welcome new pull requests from contributors.
vukadinovic•1h ago