Tell HN: Google increased existing finetuned model latency by 5x

13•deaux•2mo ago

Since 5 days ago, the latency of our Finetuned 2.5 Flash models has suddenly jumped by 5x. For those less familiar, such finetuned models are often used to get close to the performance of a big model at one specific task with much less latency and cost. This means they're usually used for realtime, production use cases that see a lot of use and where you want to respond to the user quickly. Otherwise, finetuning generally isn't worth it. Many spend a few thousand dollars (at a minimum) on finetuning a model for one such task.

Five days ago, Google released Nano Banana Pro (Gemini 3.0 Image Preview) to the world. And since five days ago, the latency of our existing finetuned models has suddenly quintupled. We've talked with other startups who also make use of finetuned 2.5 Flash models, and they're seeing the exact same, even those in different regions. Obviously this has a big impact on all of our products.

From Google's side, nothing but silence, and this is talking about paid support. The reply to the initial support ticket is a request for basic information that has already been provided in that ticket or is trivially obvious. Since then, it's been more than 48 hours of nothingness.

Of course the timing could be a pure coincidence - though we've never seen any such latency instability before - but we can all see what's most likely here; Nano Banana Pro and Gemini 3 Preview consuming a huge amount of compute, and they're simply sacrificing finetuned model output for those. It's impossible to take them seriously for business use after this, who knows what they'll do next time. For all their faults, OpenAI have been a bastion of stability, despite being the most B2C-focused of all the frontier model providers. Google with Vertex claims to be all about enterprise and then breaks product of their business customers to get consumers their Ghibli images 1% faster. They've surely gotten plenty of tickets about this, and given Google's engineering, they must have automated monitoring that catches such a huge latency increase immediately. Temporary outages are understandable and happen everywhere, see AWS and Cloudflare recently, but 5+ days - if they even fix it - of 5x latency is effectively a 5+ day outage of a service.

I'm posting this mostly as a warning to other startups here to not rely on Google Vertex for user-facing model needs going forward.

Comments

jpau•2mo ago

Hey we're also a Vertex tuning customer in a similar spot. We're seeing other capacity issues, although not a leap in latency. Can you DM me? I'd love to trade notes. https://x.com/hellofromjames

deaux•2mo ago

Not a verified X user, but happy to exchange here or elsewhere. The latency leap is still the same for us. We're on us-west1 but reports are that it's similar on at least us-central1 if not elsewhere. We simply can't use the finetuned models in prod any more due to this, but whenever we run our automated tests with them, including today, the latency is still there. We haven't seen issues on non-finetuned models.

Πfs – The Data-Free Filesystem

Go-busybox: A sandboxable port of busybox for AI agents

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery [pdf]

xAI Merger Poses Bigger Threat to OpenAI, Anthropic

Atlas Airborne (Boston Dynamics and RAI Institute) [video]

Zen Tools

Is the Detachment in the Room? – Agents, Cruelty, and Empathy

The purpose of Continuous Integration is to fail

Apfelstrudel: Live coding music environment with AI agent chat

What Is Stoicism?

What happens when a neighborhood is built around a farm

Every major galaxy is speeding away from the Milky Way, except one

Extreme Inequality Presages the Revolt Against It

There's no such thing as "tech" (Ten years later)

What Really Killed Flash Player: A Six-Year Campaign of Deliberate Platform Work

Ask HN: Anyone orchestrating multiple AI coding agents in parallel?

Show HN: Knowledge-Bank

Show HN: The Codeverse Hub Linux

Take a trip to Japan's Dododo Land, the most irritating place on Earth

British drivers over 70 to face eye tests every three years

BookTalk: A Reading Companion That Captures Your Voice

Is AI "good" yet? – tracking HN's sentiment on AI coding

Show HN: Amdb – Tree-sitter based memory for AI agents (Rust)

OpenClaw Partners with VirusTotal for Skill Security

Show HN: Seedance 2.0 Release

Leisure Suit Larry's Al Lowe on model trains, funny deaths and Disney

Towards Self-Driving Codebases

VCF West: Whirlwind Software Restoration – Guy Fedorkow [video]

Show HN: COGext – A minimalist, open-source system monitor for Chrome (<550KB)

FOSDEM 26 – My Hallway Track Takeaways