I previously experimented with Grounding DINO and SAM3. While they are amazing for generic objects, I found they struggle with specific semantic requests (e.g. specific manufacturing parts, game characters or distinguishing "a worker" from "a worker without a helmet").
I discovered that Gemini 3 Pro is surprisingly underrated for bounding box tasks if you prompt it with detailed visual descriptions. It handles semantic understanding significantly better than standard zero-shot detectors.
url: yoloforge.com
The Workflow:
Upload a zip of raw images (stored in Cloudflare R2). Describe class/classes in plain English. The system generates a .jsonl batch file and sends it to the Gemini Batch API. This allows us to process thousands of images in parallel at 50% of the standard cost. You review/correct boxes in the UI and export the YOLO train/val/test dataset.
Technical Challenges:
One hard part was getting valid JSON out of the LLM consistently. I ended up writing a robust parser that uses regex fallback strategies to literally "salvage" valid bounding boxes from malformed responses.
The Stack:
- Frontend: Next.js - Backend: FastAPI, Celery (for async zip processing and polling the batch API), Redis. - Storage: Supabase (Auth/DB), Cloudflare R2 (Image Storage). - Model: Google Gemini 3 Pro via Batch API.
There is a live demo on the landing page (no sign-up required) where you can upload a single image to test the detection logic. But of course the tool really shines with datasets that have thousands of images with multiple classes.
If you have any technical questions please ask!