Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon

https://github.com/mattmireles/gemma-tuner-multimodal

73•MediaSquirrel•2h ago

About six months ago, I started working on a project to fine-tune Whisper locally on my M2 Ultra Mac Studio with a limited compute budget. I got into it. The problem I had at the time was I had 15,000 hours of audio data in Google Cloud Storage, and there was no way I could fit all the audio onto my local machine, so I built a system to stream data from my GCS to my machine during training.

Gemma 3n came out, so I added that. Kinda went nuts, tbh.

Then I put it on the shelf.

When Gemma 4 came out a few days ago, I dusted it off, cleaned it up, broke out the Gemma part from the Whisper fine-tuning and added support for Gemma 4.

I'm presenting it for you here today to play with, fork and improve upon.

One thing I have learned so far: It's very easy to OOM when you fine-tune on longer sequences! My local Mac Studio has 64GB RAM, so I run out of memory constantly.

Anywho, given how much interest there is in Gemma 4, and frankly, the fact that you can't really do audio fine-tuning with MLX, that's really the reason this exists (in addition to my personal interest). I would have preferred to use MLX and not have had to make this, but here we are. Welcome to my little side quest.

And so I made this. I hope you have as much fun using it as I had fun making it.

-Matt

Comments

dsabanin•2h ago

Thanks for doing this. Looks interesting, I'm going to check it out soon.

MediaSquirrel•1h ago

you are welcome! It was a fun side quest

craze3•1h ago

Nice! I've been wanting to try local audio fine-tuning. Hopefully it works with music vocals too

LuxBennu•1h ago

I run whisper large-v3 on an m2 max 96gb and even with just inference the memory gets tight on longer audio, can only imagine what fine-tuning looks like. Does the 64gb vs 96gb make a meaningful difference for gemma 4 fine-tuning or does it just push the oom wall back a bit? Been wanting to try local fine-tuning on apple silicon but the tooling gap has kept me on inference only so far.

MediaSquirrel•1h ago

Memory usage increases quadratically with sequence length. Therefore, using shorter sequences during fine-tuning can prevent memory explosions. On my 64GB RAM machine, I'm limited to input sequences of about 2,000 tokens, considering my average output for the fine-tuning task is around 1,000 tokens (~3k tokens total).

yousifa•1h ago

This is super cool, will definitely try it out! Nice work

pivoshenko•56m ago

nice!

Project Glasswing: Securing critical software for the AI era

System Card: Claude Mythos Preview [pdf]

S3 Files and the changing face of S3

GLM-5.1: Towards Long-Horizon Tasks

How to get better at guitar

Lunar Flyby

Cambodia unveils a statue of famous landmine-sniffing rat Magawa

Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon

Bitcoin and Quantum Computing

A truck driver spent 20 years making a scale model of every building in NYC

Show HN: Brutalist Concrete Laptop Stand (2024)

Rescuing old printers with an in-browser Linux VM bridged to WebUSB over USB/IP

Cloudflare targets 2029 for full post-quantum security

Show HN: An interactive map of Tolkien's Middle-earth

The Image Boards of Hayao Miyazaki

A whole boss fight in 256 bytes

Assessing Claude Mythos Preview's cybersecurity capabilities

AI helps add 10k more photos to OldNYC

Google open-sources experimental agent orchestration testbed Scion

Cells for NetBSD: kernel-enforced, jail-like isolation

Move Detroit

9 Mothers (YC P26) Is Hiring – Lead Robotics and More

Taste in the age of AI and LLMs

A blind man made it possible for others with low vision to build Lego sets

We found an undocumented bug in the Apollo 11 guidance computer code

Boneyard: Generate pixel-perfect skeleton screens from your real DOM

John Coltrane Illustrates the Mathematics of Jazz

Tailslayer: Library for reducing tail latency in RAM reads

Show HN: Unicode Steganography

Moving fast in hardware: lessons from lab to $100M ARR

Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon

Comments

Project Glasswing: Securing critical software for the AI era

System Card: Claude Mythos Preview [pdf]

S3 Files and the changing face of S3

GLM-5.1: Towards Long-Horizon Tasks

How to get better at guitar

Lunar Flyby

Cambodia unveils a statue of famous landmine-sniffing rat Magawa

Show HN: Gemma 4 Multimodal Fine-Tuner for Apple Silicon

Bitcoin and Quantum Computing

A truck driver spent 20 years making a scale model of every building in NYC

Show HN: Brutalist Concrete Laptop Stand (2024)

Rescuing old printers with an in-browser Linux VM bridged to WebUSB over USB/IP

Cloudflare targets 2029 for full post-quantum security

Show HN: An interactive map of Tolkien's Middle-earth

The Image Boards of Hayao Miyazaki

A whole boss fight in 256 bytes

Assessing Claude Mythos Preview's cybersecurity capabilities

AI helps add 10k more photos to OldNYC

Google open-sources experimental agent orchestration testbed Scion

Cells for NetBSD: kernel-enforced, jail-like isolation

Move Detroit

9 Mothers (YC P26) Is Hiring – Lead Robotics and More

Taste in the age of AI and LLMs

A blind man made it possible for others with low vision to build Lego sets

We found an undocumented bug in the Apollo 11 guidance computer code

Boneyard: Generate pixel-perfect skeleton screens from your real DOM

John Coltrane Illustrates the Mathematics of Jazz

Tailslayer: Library for reducing tail latency in RAM reads

Show HN: Unicode Steganography

Moving fast in hardware: lessons from lab to $100M ARR