Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon

https://github.com/mattmireles/gemma-tuner-multimodal

95•MediaSquirrel•3h ago

About six months ago, I started working on a project to fine-tune Whisper locally on my M2 Ultra Mac Studio with a limited compute budget. I got into it. The problem I had at the time was I had 15,000 hours of audio data in Google Cloud Storage, and there was no way I could fit all the audio onto my local machine, so I built a system to stream data from my GCS to my machine during training.

Gemma 3n came out, so I added that. Kinda went nuts, tbh.

Then I put it on the shelf.

When Gemma 4 came out a few days ago, I dusted it off, cleaned it up, broke out the Gemma part from the Whisper fine-tuning and added support for Gemma 4.

I'm presenting it for you here today to play with, fork and improve upon.

One thing I have learned so far: It's very easy to OOM when you fine-tune on longer sequences! My local Mac Studio has 64GB RAM, so I run out of memory constantly.

Anywho, given how much interest there is in Gemma 4, and frankly, the fact that you can't really do audio fine-tuning with MLX, that's really the reason this exists (in addition to my personal interest). I would have preferred to use MLX and not have had to make this, but here we are. Welcome to my little side quest.

And so I made this. I hope you have as much fun using it as I had fun making it.

-Matt

Comments

dsabanin•3h ago

Thanks for doing this. Looks interesting, I'm going to check it out soon.

MediaSquirrel•2h ago

you are welcome! It was a fun side quest

craze3•3h ago

Nice! I've been wanting to try local audio fine-tuning. Hopefully it works with music vocals too

LuxBennu•2h ago

I run whisper large-v3 on an m2 max 96gb and even with just inference the memory gets tight on longer audio, can only imagine what fine-tuning looks like. Does the 64gb vs 96gb make a meaningful difference for gemma 4 fine-tuning or does it just push the oom wall back a bit? Been wanting to try local fine-tuning on apple silicon but the tooling gap has kept me on inference only so far.

MediaSquirrel•2h ago

Memory usage increases quadratically with sequence length. Therefore, using shorter sequences during fine-tuning can prevent memory explosions. On my 64GB RAM machine, I'm limited to input sequences of about 2,000 tokens, considering my average output for the fine-tuning task is around 1,000 tokens (~3k tokens total).

LuxBennu•26m ago

Ah that makes sense, quadratic scaling is brutal. So with 96gb i'd probably get somewhere around 4-5k total sequence length before hitting the wall, which is still pretty limiting for anything multimodal. Do you do any gradient checkpointing or is that not worth the speed tradeoff at these sizes?

yousifa•2h ago

This is super cool, will definitely try it out! Nice work

pivoshenko•2h ago

nice!

Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon

Show HN: Brutalist Concrete Laptop Stand (2024)

Show HN: An interactive map of Tolkien's Middle-earth

Show HN: Unicode Steganography

Show HN: A cartographer's attempt to realistically map Tolkien's world

Show HN: Finalrun – Spec-driven testing using English and vision for mobile apps

Show HN: Pion/handoff – Move WebRTC out of browser and into Go

Show HN: Open-source GDPR router for LLMs detects PII, forces EU-only inference

Show HN: Mo – checks GitHub PRs against decisions approved in Slack

Show HN: Stop paying for Dropbox/Google Drive, use your own S3 bucket instead

Show HN: Ghost Pepper – Local hold-to-talk speech-to-text for macOS

Show HN: Anos – a hand-written ~100KiB microkernel for x86-64 and RISC-V

Show HN: A (marginally) useful x86-64 ELF executable in 298 bytes

Show HN: AdaShape-3D modeler for intuitive 3D printing parts / Windows 11

Show HN: Hippo, biologically inspired memory for AI agents

Show HN: Tusk for macOS and Gnome

Show HN: TTF-DOOM – A raycaster running inside TrueType font hinting

Show HN: GovAuctions lets you browse government auctions at once

Show HN: I built a tiny LLM to demystify how language models work

Show HN: Marimo pair – Reactive Python notebooks as environments for agents

Show HN: A social feed with no algo where communities decide what gets seen

Show HN: A reasoning hierarchical robotics pipeline you can run in the browser

Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

Show HN: The King James Bible deserved a better website

Show HN: Gemma Gem – AI model embedded in a browser – no API keys, no cloud

Show HN: Clawcast – A peer-to-peer podcast network for agents

Show HN: C64 Ultimate Toolbox for macOS

Show HN: Interactive object storage cost calculator

Show HN: I made a YouTube search form with advanced filters

Show HN: A game where you build a GPU

Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon

Comments

Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon

Show HN: Brutalist Concrete Laptop Stand (2024)

Show HN: An interactive map of Tolkien's Middle-earth

Show HN: Unicode Steganography

Show HN: A cartographer's attempt to realistically map Tolkien's world

Show HN: Finalrun – Spec-driven testing using English and vision for mobile apps

Show HN: Pion/handoff – Move WebRTC out of browser and into Go

Show HN: Open-source GDPR router for LLMs detects PII, forces EU-only inference

Show HN: Mo – checks GitHub PRs against decisions approved in Slack

Show HN: Stop paying for Dropbox/Google Drive, use your own S3 bucket instead

Show HN: Ghost Pepper – Local hold-to-talk speech-to-text for macOS

Show HN: Anos – a hand-written ~100KiB microkernel for x86-64 and RISC-V

Show HN: A (marginally) useful x86-64 ELF executable in 298 bytes

Show HN: AdaShape-3D modeler for intuitive 3D printing parts / Windows 11

Show HN: Hippo, biologically inspired memory for AI agents

Show HN: Tusk for macOS and Gnome

Show HN: TTF-DOOM – A raycaster running inside TrueType font hinting

Show HN: GovAuctions lets you browse government auctions at once

Show HN: I built a tiny LLM to demystify how language models work

Show HN: Marimo pair – Reactive Python notebooks as environments for agents

Show HN: A social feed with no algo where communities decide what gets seen

Show HN: A reasoning hierarchical robotics pipeline you can run in the browser

Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

Show HN: The King James Bible deserved a better website

Show HN: Gemma Gem – AI model embedded in a browser – no API keys, no cloud

Show HN: Clawcast – A peer-to-peer podcast network for agents

Show HN: C64 Ultimate Toolbox for macOS

Show HN: Interactive object storage cost calculator

Show HN: I made a YouTube search form with advanced filters

Show HN: A game where you build a GPU