Cloud Run GPUs, now GA, makes running AI workloads easier for everyone

https://cloud.google.com/blog/products/serverless/cloud-run-gpus-are-now-generally-available

311•mariuz•1d ago

Comments

lemming•1d ago

If I understand this correctly, I should be able to stand up an API running arbitrary models (e.g. from Hugging Face), and it’s not quite charged by the token but should be very cheap if my usage is sporadic. Is that correct? Seems pretty huge if so, most of the providers I looked at required a monthly fee to run a custom model.

42lux•1d ago

Runpod, vast, coreweave, replicate... just a bunch of alternatives that let you run serverless GPU inference.

_zoltan_•1d ago

you can't just sign up for coreweave, can you?

42lux•1d ago

We wrote them an email.

lexandstuff•1d ago

Yes, that's basically correct. Except be warned that the cold start times can be huge (30-60 seconds). So scaling to 0 doesn't really work in practice, unless your users are happy to wait from time to time. Also, you also have to pay a small monthly fee for container storage (and a few other charges iirc).

montebicyclelo•1d ago

Reason Cloud Run is so nice compared to other providers is that it has autoscaling, with scaling to 0. Meaning it can cost basically 0 if it's not being used. Also can set a cap on the scaling, e.g. 5 instances max, which caps the max cost of the service too. - Note, I only have experience with the CPU version of Cloud Run, (which is very reliable / easy).

rvnx•1d ago

Even regular Cloud Run can take a lot of time to boot (~3 to 30 seconds), so this can be a problem when scaling to 0

lexandstuff•1d ago

Not to mention, if it's an ML workload, you'll also have to factor in downloading the weights and loading them into memory, which can double that time or more.

rvnx•1d ago

According to the press release, "we achieved an impressive Time-to-First-Token of approximately 19 seconds for a gemma3:4b model"

Imagine, you have a very small weak model, and you have to wait 20 seconds for your request.

infecto•1d ago

Imagine running a production client facing api and not overprovisioning it.

happyopossum•1d ago

> Imagine, you have a very small weak model, and you have to wait 20 seconds for your request.

For your first request, after having scaled to 0 while it wasn’t in use. For a lot of use cases, that sounds great.

steren•1d ago

Also, a GPU instance needs 5s to start. The test depends on how large the model is. So a "very small weak model" can lead much faster than 20s

mdhb•1d ago

I’m looking at logs for a service I run on cloud run right now which scales to zero. Boot times are approximate 200ms for a Dart backend.

gizzlon•1d ago

That's not my experience, using Go. Never measured, but it goes to 0 all the time, so I would definitely noticed more than a couple of seconds.

827a•1d ago

It depends on whether you're on gen1 or gen2 Cloud Run; the default execution environment is `default` which means "you have no idea because GCP selects for you" (not joking).

Counterintuitively (again, not joking): gen2 suffers from really bad startup speeds, because its more like a full-on linux VM/container than whatever weird shim environment gen1 runs. My Gen2 containers basically never start up faster than 3 seconds. Gen1 is much faster.

Note that gen1 and gen2 Cloud Run execution environments are an entirely different concept than first generation and second generation Cloud Functions. First gen Cloud Functions are their own thing. Second generation Cloud Functions can be either first generation or second generation Cloud Run workloads, because they default to the default execution environment. Believe it or not, humans made this.

isoprophlex•1d ago

All the cruft of a big cloud provider, AND the joy of uncapped yolo billing that has the potential to drain your creditcard overnight. No thanks, I'll personally stick with Modal and vast.ai

kamranjon•1d ago

I dunno, the scale to zero and pay per second features seemed super useful to me after forgetting to shut down some training instances with AWS. Also the fast startup ability, if it actually works as well as they say, would be amazing for a lot of the type of workloads that I have.

isoprophlex•1d ago

Agreed, but runpod or modal offer the same. Happy to use big cloud for a client if they pay the bills, but for personal quests... too scary.

rikafurude21•1d ago

thats what billing limits are for

aiiizzz•1d ago

Those, on gcp, are just alerts, not hard limits, no?

jsheard•1d ago

Yeah. I think you can hack together a function which pulls the plug automatically if a billing alert fires, but IIRC the alerts can take a few hours to respond, so extreme runaway usage could still result in a bad time.

worldsayshi•1d ago

I would rather not realize i had left a bug in that hack after the fact.

isoprophlex•1d ago

Unless something changed gcp only does billing alerts, not billing limits

montebicyclelo•1d ago

Not providing a cap on spending is a major flaw of GCP for individuals / small projects.

With Cloud Run, AFAIK, spending can effectively be capped by: limiting concurrency, plus limiting the max number of instances it can scale to. (But this is not as good as GCP having a proper cap.)

brutus1213•1d ago

Amazon is the same I think? I live in constant fear we will have a runaway job one day. I get daily emails to myself (as a manager) and to my finance person. We had one instance where a team member forgot to turn off a machine for a few months :(

I get why it is a business strategy to not have limits .. but I wonder if providers would get more usage if people had more trusts on costs/predictability.

anonymousab•1d ago

I remember going out to dinner, years ago, with a fairly senior AWS billing engineer. An acquaintance of a coworker.

He looked completely surprised when I asked about runaway billing and why there wasn't any simple options to cap a given resource to prevent those cases.

His response was that they didn't build that because none of their customers wanted anything like that, as far as he was aware.

sidibe•1d ago

I'm sure lot of people at Amazon and Google are aware small customers want this and it's a feature they'd like to brag about, but it is much harder to implement a real time quota on spend than a daily batched job for the money part + realtime resource scoped quotas.

dragandj•1d ago

Yeah, right. Capping a resource, such a wild idea. Of course they won't implement it for the same reason bar owners don't put a cap on drinks.

mtrovo•1d ago

I don't know if the analogy works that well, the assumption is that you're making more money then you put in the more traffic you get. As a bar owner is the choice between closing your bar for the month when you run out of beer or running to the supplier to bring more kegs.

ElevenLathe•1d ago

Aren't bars actually required to cap drinks? It's usually phrased as having to refuse serving if you're visibly drunk, but still effectively a cap. That said, a big cloud bill doesn't make you intoxicated. The more I examine this analogy, the less it makes sense.

mwest217•1d ago

Disclaimer: I work at Google but not on cloud. Opinions my own.

I think the reason this doesn’t get prioritized is that large customers don’t actually want a “stop serving if I pass this limit” amount. If there’s a spike in traffic, they probably would rather pay the money to serve it. The customers that would want this feature are small-dollar customers, and from an economic perspective it makes less sense to prioritize this feature, since they’re not spending very much relative to customers who wouldn’t want this feature.

Maybe if there weren’t more feature requests to get prioritized this might happen, but the reality is that there are always more feature requests than time to implement them, and a feature request used almost exclusively by the smallest dollar customers will always lose to a feature for big-dollar customers.

montebicyclelo•1d ago

I guess where it could potentially bring value is by:

Removing a major concern that prevents individuals / small customers from using GCP in the first place; so more of them do use it

That could then lead to value in two ways:

- They make small projects that go on to be large projects later, (e.g. a small app that grows / becomes successful, becomes a moneymaker)

- Or, they might then be more inclined to get their big corp to use GCP later on, if they've already been using it as an individual

But that's long term, and hard to measure / put a number on

coredog64•1d ago

As noted above, there is enough value here such that AWS implemented this several years ago. Said implementation is appropriate for both personal AWS accounts and large scale multi-account organizations.

Having implemented this on behalf of others several times, I'll share the common pain points: * There's a long lead time. You need to enable Cost Explorer (24-48 hours). If you're trying for fine distinctions, activating tags as cost allocation tags is another 24 hours * AWS cost data is a lagging indicator, so you need to be able to absorb a day of charges * Automation support is poor, especially for organizations * Organization budgets configured at the account level are misleading if you don't understand how they're configured

What's really wanted here is that AWS needs to commit to more timely cost data delivery such that you can create an hourly budget with an associated action.

happyopossum•1d ago

> Said implementation is appropriate for both personal AWS accounts and large scale multi-account organizations.

Followed by a list of caveats that make it wholly irrelevant for an individual who is afraid of a surprise charge covering less than several days.

jiggawatts•22h ago

Every large enterprise has insurmountable difficult even imagining why customers would want something as bizarre as a "stop loss" on their spending...

... right up until it's their own bottom line that is at risk, and then like magic spending limits become a critical feature.

For example, Azure has no stop-loss feature for paid customers, but it does for the "free" Visual Studio subscriber credits. Because if some random dev with a VS subscription blows through $100K of GPU time due to a missing spending constraint, that's Microsoft's problem, not their own.

It's as simple as that.

152132124•1d ago

None of their Big Customers they meant, the small ones who worry about this doesn't matter.

tmoertel•1d ago

> I get why it is a business strategy to not have limits...

What is the strategy? Is is purely market segmentation? (As in: "If you need to worry about spending too much, you're not the big-money kind of enterprise customer we want"?)

nprateem•1d ago

It's not a strategy. It's technically difficult, opens them to liability if runaway happens so fast their system can't stop it, and is only wanted by bottom of the barrel customers.

physix•1d ago

Just a thought: Maybe if they had some kind of opt-in insurance against overuse until the circuit breaker can kick in?

But, looking from the outside, the lack of protection is effectively a win for them. They don't need to invest in building that out, and their revenue is increased by not having it (if you ignore the effect of throttling adoption). So I have always assumed that there is simply no business case for that, so why bother?

coredog64•1d ago

There's a coarse option: Set up a budget and then a budget action. While ECS doesn't have GPU capabilities, the equivalent here would be "IAM action of budget sets deny on expensive service IAM action" (SCP is also available, but that requires an AWS Org, at which point you've probably got a team that already knows this)

It's coarse because it's daily and not hourly. However, you could also self-service do some of this with CloudWatch metrics to map to a cost and then have an alarm action.

https://aws.amazon.com/blogs/mt/manage-cost-overruns-part-1/

delfinom•1d ago

Flaw? Nah

Feature for Google's profits.

gabe_monroy•1d ago

Heard on this feedback. While not quite a hard cap, I'd also point to https://cloud.google.com/billing/docs/how-to/budgets which many customers are having success with for this use case.

yarri•1d ago

[edit - Gabe responded]. See this Cloud Run spending cap recommendation [0] to disable billing, which potentially irreversibly deletes resources but does cap spend!

[0] https://cloud.google.com/billing/docs/how-to/disable-billing...

badrequest•1d ago

Sure, but why post a tutorial of how to spin this up in GCP instead of...productizing it in GCP?

advisedwang•1d ago

It's rock and a hard place for the cloud providers.

Cap billing, and you have created an outage waiting to happen, one that will be triggered if they ever have sudden success growth.

Don't cap billing, and you have created a bankruptcy waiting to happen.

nprateem•1d ago

Cloud Run is great but no billing limits is too scary. No idea why they don't address this. They must know if they support individuals we'll eventually leave our saases there.

randlet•1d ago

Setting max instances effectively caps your spend right?

vrosas•1d ago

Yes. CR has had this feature since day 1, people just don't bother to read the docs and would rather write long blog posts blaming their cloud provider for manufacturing the gun they shot themself in the foot with.

decimalenough•1d ago

You can set max instances in Cloud Run, which is an effective limit on how much you'll spend.

Also, hard dollar caps are rarely if ever the right choice. App Engine used to have these, and the practical effect was that your website would completely stop working exactly when you least want it to (posted on HN etc).

It's better to set billing alerts and make the call yourself if they go off.

rustc•1d ago

> Also, hard dollar caps are rarely if ever the right choice.

Depends on if you're a big business or an individual. There is absolutely no reason I would ever pay $100k for a traffic burst on my personal site or side project (like the $100k Netlify case a few months ago).

> It's better to set billing alerts and make the call yourself if they go off.

Billing alerts are not instant and neither is anyone online 24x7 monitoring the alerts.

brutus1213•1d ago

100% agreed. This can be solved with technology .. let users set a soft and hard threshold for example. Runaway costs is the problem here.

ipaddr•1d ago

One bad actor / misconfiguration / attack can put you out of business. It not the safest strategy to allow unlimited liability in business or for personal projects.

petesergeant•1d ago

I've abandoned DataDog in production for just this reason. Is the amount of money they make on dinging people who screw up really worth the ill-will and people who decide they're just not going to start projects on these platforms?

geodel•1d ago

> Is the amount of money they make on dinging people who screw up really worth the ill-will

I think it is .

1) They make money for services they provided instead of looking into meaning of what customer actually wanted.

2) Small time customers move away so they concentrate energy on big enterprise sales.

Not justifying anything here but it just kind of make business sense for them.

petesergeant•22h ago

Definitely possible. I wonder over what time period you miss out on small customers who become big customers and go on that journey with you; perhaps that would be minimal anyway.

spacecadet•1d ago

Runpod is pretty great. I wrote some genetic end point script that I can deploy in seconds, download the models to the pod, and Im ready to go. Plus I forgot and left a pod running, but down, for a week and it was like 0.60, and they emailed me like 3 times reminding me of the pod.

weinzierl•1d ago

I never used modal or vast.ai and from their pages it was not obvious how they solve the yolo billing issue? Are they pre-paid or do they support caps?

sharifhsn•1d ago

I know vast.ai uses a prepaid credits system.

geodel•1d ago

Doesn't seem vast. Seems tight-budget.ai to me :-)

thundergolfer•1d ago

Engineer from Modal here: we support caps. They kick in within ~2s if your usage exceeds the configured limit.

oldandboring•1d ago

> uncapped yolo billing

This made me laugh out loud, thank you for this!

gardnr•1d ago

The Nvidia L4 has 24GB of VRAM and consumes 72 watts, which is relatively low compared to other datacenter cards. It's not a monster GPU, but it should work OK for inference.

m4r1k•1d ago

I wrote about cloud run and inference w/ ollama on cloud run -> https://medium.com/google-cloud/ollama-on-cloud-run-with-gpu...

performance is okay, ada lovelace has cuda 8_9 support which brings native fp8 support. imo the best aspect is the speed of spinning up new containers and the overall easiness of this service. the live demo at google next 25 was quite something https://www.youtube.com/watch?v=PWPvX25R6dM&t=2140s

jbarrow•1d ago

I’m personally a huge fan of Modal, and have been using their serverless scale-to-zero GPUs for a while. We’ve seen some nice cost reductions from using them, while also being able to scale WAY UP when needed. All with minimal development effort.

Interesting to see a big provider entering this space. Originally swapped to Modal because big providers weren’t offering this (e.g. AWS lambdas can’t run on GPU instances). Assuming all providers are going to start moving towards offering this?

chrishare•1d ago

Modal documentation is also very good.

AndresSRG•1d ago

I’m also a big fan.

Modal has the fastest cold-start I’ve seen for 10GB+ models.

scj13•1d ago

Modal is great, they even released a deep dive into their LP solver for how they're able to get GPUs so quickly (and cheaply).

Coiled is another option worth looking at if you're a Python developer. Not nearly as fast on cold start as Modal, but similarly easy to use and great for spinning up GPU-backed VMs for bursty workloads. Everything runs in your cloud account. The built-in package sync is also pretty nice, it auto-installs CUDA drivers and Python dependencies from your local dev context.

(Disclaimer: I work with Coiled, but genuinely think it's a good option for GPU serverless-ish workflows. )

dr_kiszonka•1d ago

Thanks for sharing! They even support running HIPAA-compliant workloads, which I didn't anticipate.

holografix•1d ago

The value in this really is running small custom models or the absolute latest open weight models.

Why bother when you can get payg API access to popular open weights models like Llama on Vertex AI model garden or at the edge on Cloudflare?

progbits•1d ago

Custom models.

We use this, pretty convenient and less hassle than managing our autoscaling GPU pools.

mythz•1d ago

The pricing doesn't look that compelling, here are the hourly rate comparisons vs runpod.io vs vast.ai:

    1x L4 24GB:    google:  $0.71; runpod.io:  $0.43, spot: $0.22
    4x L4 24GB:    google:  $4.00; runpod.io:  $1.72, spot: $0.88
    1x A100 80GB:  google:  $5.07; runpod.io:  $1.64, spot: $0.82; vast.ai  $0.880, spot:  $0.501
    1x H100 80GB:  google: $11.06; runpod.io:  $2.79, spot: $1.65; vast.ai  $1.535, spot:  $0.473
    8x H200 141GB: google: $88.08; runpod.io: $31.92;              vast.ai $15.470, spot: $14.563

Google's pricing also assumes you're running it 24/7 for an entire month, where as this is just the hourly price for runpod.io or vast.ai which both bill per second. Wasn't able to find Google's spot pricing for GPUs.

ZiiS•1d ago

I think the Google prices are billed per-second so under 20min you are better on Google?

thousand_nights•1d ago

runpod is billed by the minute

bts4•4h ago

Technically we bill Pods by the millisecond. Pennies matter :)

mythz•1d ago

RunPod also charges per second [1], also this is Google's expected avg cost per hour after running it 24/7 for an entire month, I couldn't find an hourly cost for each GPU.

When you need under <1hr than you can go with Runpod's Spot pricing which is ~4-7x cheaper than Google, where even 20min of Google would cost more than 1hr on RunPod.

[1] https://docs.runpod.io/serverless/pricing

otherjason•1d ago

Where did you get the pricing for vast.ai here? Looking at their pricing page, I don't see any 8xH200 options for less than $21.65 an hour (and most are more than that).

zackangelo•1d ago

I think it’s a typo, looks pretty close to their 8xH100 prices.

counters•1d ago

Nothing but 1xL4 are even offered on Cloud Run GPUs, are they?

steren•1d ago

> Google's pricing also assumes you're running it 24/7 for an entire month

What makes you think that?

Cloud Run [pricing page](https://cloud.google.com/run/pricing) explicitly says : "charge you only for the resources you use, rounded up to the nearest 100 millisecond"

Also, Cloud Run's [autoscalling](https://cloud.google.com/run/docs/about-instance-autoscaling) is in effect, scaling down idle instances after a maximum of 15 minutes.

(Cloud Run PM)

progbits•1d ago

You can just go to "create compute instance" to see the spot pricing.

Eg GCP price for spot 1xH100 is $2.55/hr, lower with sustained use discounts. But only hobbyists pay these prices, any company is going to ask for a discount and will get it.

jjuliano•1d ago

I'm the developer of kdeps.com, and I really like Google Cloud Run, been using it since beta version. Kdeps outputs Dockerized full-stack AI agent apps that runs open-source LLMs locally, and my project works so well with GCR.

ninetyninenine•1d ago

Im tired of using AI in cloud services. I want user friendly locally owned AI hardware.

Right now nothing is consumer friendly. I can’t get a packaged deal of some locally running ChatGPT quality UI or voice command system in an all in one package. Like what Macs did for PCs I want the same for AI.

Hilift•1d ago

Oracle just announced they are spending $40 billion on GPU hardware. All cloud providers have an AI offering, and there are AI-specific cloud providers. I don't think retail is invited.

Disposal8433•1d ago

Your local computer is not powerful enough, and that's why you must welcome those brand new mainframes... I mean, "cloud services."

pjmlp•1d ago

It is funny how using a Web IDE, and a cloud shell, is such a déjà vu from when I used to do development on a common UNIX server shared by the whole team.

edoceo•1d ago

Telnet from a Wyse terminal.

pjmlp•1d ago

My first experience with such a setup was connecting to DG/UX, via the terminal application on Windows for Workgroups, or some thin client terminals in a mix of green or ambar phosphor, spread around the campus.

The only time I used a Pascal compiler in ISO Pascal mode, it had the usual extensions inspired on UCSD, but we weren't allowed to use them on the assignments.

ninetyninenine•1d ago

My local computer is not powerful enough to run training but it can certainly run an LLM. How do I know? Many other people and I have already done it. Deepseek for example can be run locally but it’s not a user friendly setup.

I want an Amazon echo agent running my home with a locally running LLM.

petesergeant•1d ago

Hoping the DGX Spark will deliver on this

Gracana•1d ago

It will not. 273GB/s memory bandwidth is not enough.

Workaccount2•1d ago

From the most unexpected place (but maybe expected if you believed they were paying attention)

Maxsun is releasing a a 48GB dual Intel Arc Pro B60 GPU. It's expected to cost ~$1000.

So for around $4k you should be able to build an 8 core 192GB local AI system, which would allow you to locally run some decent models.

This also assumes the community builds an intel workflow, but given how greedy Nvidia is with vram, it seems poised to be a hit.

zorgmonkey•1d ago

The price of that system is unfortunately going to end up being a lot more than 4k, you'd need a CPU that has at least 64 lanes of PCIe. That's going to be either a Xeon W or a Threadripper CPU, with the motherboard RAM, etc you're probably looking at least another 2k.

Also kind of a nitpick, but I'd call that 8 GPU system, each BMG-G21 die has 20 Xe2 cores. Also even though it would be 4 PCIe cards it is probably best to think of it as 8 GPUs (that's how it will show up in stuff like pytorch), especially because their is no high-speed interconnect between the GPU dies colocated on the card. Also if you're going to do this make sure you get a motherboard with good PCIe bifurcation support.

ata_aman•1d ago

I made something[0] last year to have something very consumer friendly. Unbox->connect->run. First iteration is purely to test out the concept and is pretty low power, currently working on a GPU version for bigger models and launching Q4 this year.

[0] https://persys.ai

Aeolun•1d ago

That’s 67ct / hour for a gpu enabled instance. That’s pretty good, but I have no idea how T4 GPU’s compare against others.

wut42•1d ago

L4 are pretty limited nowadays. It's usually rented at 40ct/hour on other providers.

omneity•1d ago

> Time-to-First-Token of approximately 19 seconds for a gemma3:4b model (this includes startup time, model loading time, and running the inference)

This is my biggest pet-peeve with serverless GPU. 19 seconds is a horrible latency from the user’s perspective and that’s a best case scenario.

If this is the best one of the most experienced teams in the world can do, with a small 4B model, then it feels like serverless is really restricted to non-interactive use cases.

bravesoul2•1d ago

Looks like GPU instances not "lambda", so presumable you would over-provision to compensate.

diggan•1d ago

That has to be cold-start, and next N requests would surely be using the already started thing? It sounds bananas they'd even mention using something like that with 19 seconds latency for all requests in any context.

agcat•1d ago

That's true. Traditional single-tier storage can not meet the throughput and latency demand. My cofounder wrote this piece on a three-tiered storage architecture optimized for both performance and cost - https://nilesh-agarwal.com/three-tier-storage-architecture-f...

wut42•1d ago

Definitely -- and yet it's kinda a feat compared to other solutions: when i tried Runpod Serverless i could wait up to five minutes for a cold start to a even more smaller model than a 4B.

infecto•1d ago

If you were running a real business with these would the aim not be to overprovision and to setup auto scaling in such a way that you always have excess capacity?

omneity•1d ago

That seems to be the gist of it. You cannot rely on serverless alone and you need one or many pre-warmed instances at all times. This distinction is rarely mentioned in serverless GPU spaces yet has been my experience in general.

nullpointerexp•1d ago

When scaling from 0 to 1 instances, yes, you have to wait 19 seconds.

For scaling N --> N+1 - If you configure the correct concurrency value (the number of parallel requests one instance can handle), Cloud Run will scale up to additional instances when getting to X% (I think it's 70%). That will be before the instance is fully exhausted. So your users should not experience the 19 seconds cold start.

happyopossum•1d ago

Sure, but how often is an enterprise deployed LLM application really cold-starting? While you could run this for one-off and personal use, this is probably more geared towards bursty ‘here’s an agent for my company sales reps’ kind of workloads, so you can have an instance warmed, then autoscale up at 8:03am when everyone gets online (or in the office or whatever).

At that point, 19 seconds looks great, as lower latency startup times allow for much more efficient autoscaling.

huksley•1d ago

A small and independent EU GPU cloud provider, DataCrunch (I am not affiliated), offers VMs with Nvidia GPUs even cheaper than Run Pod, etc

1x A100 80Gb 1.37€/hour

1x H100 80Gb 2.19€/hour

diggan•1d ago

Or go P2P with Vast.ai, cheapest A100 right now is a setup with 2x A100 for $0.8/hour (so $0.4 per A100). Not affiliated with them, but mostly happy user. Be vary of network speeds though, some hosts are clearly on shared bandwidth and reported numbers don't always line up with reality, which kind of sucks when you're trying to shuffle around 100GB of data.

triknomeister•1d ago

You really need NVL for some performance.

diggan•1d ago

Ok, did you check the instance list? There is a bunch of 8x H200 NVL available?

sigmoid10•1d ago

That's funny. You can get a 1x H100 80Gb VM at lambda.ai for $2.49/hour. At the current exchange rate, that's exactly 2.19€. Coincidence or is this actually some kind of ceiling?

moeadham•1d ago

if only they had some decent GPUs. L4s are pretty limited these days.

gabe_monroy•1d ago

this is something we are working on. stay tuned!

steren•1d ago

Cloud Run PM lead here.

ashishb•1d ago

I love Google Cloud Run and highly recommend it as the best option[1]. The Cloud Run GPU, however is not something I can recommend. It is not cost effective (instance based billing is expensive as opposed to request based billing), GPU choices are limited, and the general loading/unloading of model (gigabytes) from GPU memory makes it slow to be used as server less.

Once you compare the numbers it is better to use a VM + GPU if the utilization of your service is even only for 30% of the day.

1 - https://ashishb.net/programming/free-deployment-of-side-proj...

ivape•1d ago

Maybe I just don't know, but I really don't think most people here can even point to a cloud GPU with 1000 concurrent users and not end up with a million dollar bill.

icedchai•1d ago

Cloud Run is a great service. I find it much easier to work with than AWS's equivalent (ECS/Fargate.)

AChampaign•1d ago

I think Lambda is more or less the AWS equivalent.

ZeroCool2u•1d ago

I think Cloud Run Functions would be the direct equivalent to Lambda.

AChampaign•1d ago

Oh, you’re probably right.

hn_throwaway_99•1d ago

I agree, but in the GCP world, a lot of these things are merging. My understanding is that Cloud Run, Cloud Run Functions (previously known as Cloud Functions Gen2) and even App Engine Flexible all run in the same underlying cloud run infrastructure, so it's essentially just some interface differences that to me now seem more like historical legacy/backwards compatibility reasons than meaningful functionality differences (e.g. Functions can now handle multiple concurrent requests).

yegle•1d ago

FWIW, App Engine Flexible is a different product that runs on GCE VM.

Other products (App Engine standard, Cloud Functions gen1, Cloud Run, Cloud Run Functions) share many underlying infrastructures.

hn_throwaway_99•1d ago

Oh, thanks! I guess I had it backwards - I thought App Engine standard was the one on a different infrastructure.

shiftyck•1d ago

Eh idk Cloud Run is much better suited to long running instances than Lambda. You would use Cloud Functions for those types of workloads in GCP.

weberer•1d ago

For those who don't know, AWS Lambda functions have a hard limit of 15 minutes.

icedchai•1d ago

It's not. Cloud Run can be longer running: you can have batch and services. Lambda is closer to Cloud Functions.

gabe_monroy•1d ago

i am biased, but i agree :)

icedchai•1d ago

hah. I looked at your comments and saw you were a google VP! I've migrated some small systems from AWS to GCP for various POCs and prototypes, mostly Lambda and ECS to Cloud Run, and find GCP provides a better developer experience overall.

gabe_monroy•1d ago

love that you're enjoying the devex. we put a lot of sweat into it, especially in services like cloud run.

ashishb•1d ago

Yeah, anyone who uses GCP and AWS thoroughly will agree that GCP is a superior developer experience.

The problem is continuous product churn. This was discussed at length at https://news.ycombinator.com/item?id=41614795

psanford•1d ago

AWS AppRunner is the closest equivalent to Cloud Run. Its really not close though, AppRunner is an unloved service at AWS and is missing a lot of the features that make Cloud Run nice.

vrosas•1d ago

AppRunner was Amazon's answer to AppEngine a full decade+ later. Cloud Run is miles ahead.

romanhn•1d ago

I agree with the unloved part. It was a great middle ground between Lambda and Fargate (zero cold start, reasonable pricing), but has seemingly been in maintenance mode for quite a while now. Really sad to see.

dig1•1d ago

> I love Google Cloud Run and highly recommend it as the best option

I'd love to see the numbers for Cloud Run. It's nice for toy projects, but it's a money sink for anything serious, at least from my experience. On one project, we had a long-standing issue with G regarding autoscaling - scaling to zero sounds nice on paper, but they will not mention you the warmup phases where CR can spin up multiple containers for a single request and keep them for a while. And good luck hunting for unexplainedly running containers when there are no apparent cpu or network uses (G will happily charge you for this).

Additionally, startup is often abysmal with Java and Python projects (although it might perform better with Go/C++/Rust projects, but I don't have experience running those on CR).

tylertreat•1d ago

> It's nice for toy projects, but it's a money sink for anything serious, at least from my experience.

This is really not my experience with Cloud Run at all. We've found it to actually be quite cost effective for a lot of different types of systems. For example, we ended up helping a customer migrate a ~$5B/year ecommerce platform onto it (mostly Java/Spring and Typescript services). We originally told them they should target GKE but they were adamant about serverless and it ended up being a perfect fit. They were paying like $5k/mo which is absurdly cheap for a platform generating that kind of revenue.

I guess it depends on the nature of each workload, but for businesses that tend to "follow the sun" I've found it to be a great solution, especially when you consider how little operations overhead there is with it.

bodantogat•1d ago

I had the opposite experience with cloud run. Mysterious scale outs/restarts - I had to buy a paid subscription to cloud support to get answers and found none. Moved to self managed VMs. Maybe things have changed now.

PaulMest•1d ago

Sadly this is still the case. Cloud Run helped us get off the ground. But we've had two outages where Google Enhanced Support could give us no suggestion other than "increase the maximum instances" (not minimum instances). We were doing something like 13 requests/min on this instance at the time. The resource utilization looked just fine. But somehow we had a blip in any containers being available. It even dropped below our min containers. The fix was to manually redeploy the latest revision.

We're now investigating moving to Kubernetes where we will have more control over our destiny. Thankfully a couple people on the team have experience with this.

Something like this never happened with Fargate in the years my previous team had used that.

ajayvk•1d ago

https://github.com/claceio/clace is project I am building which gives a Cloud Run type deployment experience on your own VMs. For each app, it supports scale down to zero containers (scaling up beyond one is being built).

The authorization and auditing features are designed for internal tools, any app can be deployed otherwise.

holografix•16h ago

Have a look at Knative

Bombthecat•12h ago

Knative is amazing!

JoshTriplett•1d ago

Does Cloud Run still use a fake Linux kernel emulated by Go, rather than a real VM?

Does Cloud Run give you root?

seabrookmx•1d ago

You're thinking of gvisor. But no, the "gen2" runtime is a microvm ala firecracker and performs a lot better as a result.

JoshTriplett•1d ago

Ah, that's great.

And it looks like Cloud Run can do something Lambda can't: https://cloud.google.com/run/docs/create-jobs . "Unlike a Cloud Run service, which listens for and serves requests, a Cloud Run job only runs its tasks and exits when finished. A job does not listen for or serve requests."

pryz•1d ago

https://github.com/cloud-hypervisor/cloud-hypervisor or something else?

seabrookmx•9h ago

Possibly? I haven't found any public documentation that says specifically what hypervisor is used.

Google built crosvm which was the initial inspiration for firecracker, but Cloud Run runs on top of Borg (this fact is publicly documented). Borg is closed source, so it's possible the specific hypervisor they're using is as well.

rpei•1d ago

We (I work on Cloud Run) are working on root access. If you'd like to know more you can reach me rpei@google.com

JoshTriplett•1d ago

Awesome! I'll reach out to you, thank you.

gabe_monroy•1d ago

google vp here: we appreciate the feedback! i generally agree that if you have a strong understanding of your static capacity needs, pre-provisioning VMs is likely to be more cost efficient with today's pricing. cloud run GPUs are ideal for more bursty workloads -- maybe a new AI app that doesn't yet have PMF, where you really need that scale-to-zero + fast start for more sparse traffic patterns.

jakecodes•1d ago

Appreciate the thoughtful response! I’m actually right in the ICP you described — I’ve run my own VMs in the past and recently switched to Cloud Run to simplify ops and take advantage of scale-to-zero. In my case, I was running a few inference jobs and expected a ~$100 bill. But due to the instance-based behavior, it stayed up the whole time, and I ended up with a $1,000 charge for relatively little usage.

I’m fairly experienced with GCP, but even then, the billing model here caught me off guard. When you’re dealing with machines that can run up to $64K/month, small missteps get expensive quickly. Predictability is key, and I’d love to see more safeguards or clearer cost modeling tooling around these types of workloads.

gabe_monroy•1d ago

Apologies for the surprise charge there. It sounds like your workload pattern might be sitting in the middle of the VM vs. Serverless spectrum. Feel free to email me at (first)(last)@google.com and I can get you some better answers.

ashishb•1d ago

> But due to the instance-based behavior, it stayed up the whole time, and I ended up with a $1,000 charge for relatively little usage.

Indeed. IIRC, if you get a single request every 15 mins (~100 requests a day), you will pay for Cloud Run GPU for the full day.

krembo•1d ago

How does that compare to spinning up some ec2s with amazon trainium gpus?

mgraczyk•1d ago

Depending on your model, you may spend a lot of time trying to get it to work with Trainium

Sn0wCoder•1d ago

Has this changed? When I looked pre-ga the requirements were you need to pay for the CPU 24x7 to attach a GPU so that is not really scaling to zero unless this requirement has changed...

ashishb•1d ago

Speaking from my experience, it does scale to zero except you pay for 15 mins after the last request.

So if you get all your requests in a 2 hours window then that's great. It will scale to zero for rest of the 22 hours.

However, if you get at least one request every 15 mins then you will pay for 24 hours and it is ~3X more expensive then equivalent VM on Google Cloud.

mountainriver•1d ago

The problem is you can't reliably get VMs on GCP.

All the major clouds are suffering from this. AWS you can't ever get an 80gb gpu without a long term reserve and even then it's wildly expensive. GCP you can sometimes but its also insanely expensive.

These companies claim to be "startup friendly", they are anything but. All the neo-clouds somehow manage to do this well (runpod, nebius, lambda) but the big clouds are just milking enterprise customers who won't leave and in the process screwing over the startups.

This is a massive mistake they are making, which will hurt their long term growth significantly.

dconden•1d ago

Agreed. Pricing is insane and availability generally sucks.

If anyone is curious about these neo-clouds, a YC startup called Shadeform has their availability and pricing in a live database here: https://www.shadeform.ai/instances

They have a platform where you can deploy VMs and bare metal from 20 or so popular ones like Lambda, Nebius, Scaleway, etc.

rendaw•1d ago

We've hit into this a lot lately too, even on AWS. "Elastic" compute, but all the elasticity's gone. It's especially bitter since splitting the costs for spare capacity is the major benefit of scale here...

mountainriver•1d ago

Enterprises are just gobbling up all the supply on reserves so they see no need to lower the price.

All the while saying they are "startup friendly".

covi•1d ago

To massively increase the reliability to get GPUs, you can use something like SkyPilot (https://github.com/skypilot-org/skypilot) to fall back across regions, clouds, or GPU choices. E.g.,

$ sky launch --gpus H100

will fall back across GCP regions, AWS, your clusters, etc. There are options to say try either H100 or H200 or A100 or <insert>.

Essentially the way you deal with it is to increase the infra search space.

Bombthecat•12h ago

You don't go to cloud services because they are cheaper.

You go there because you are already there or have contracts etc etc

ringeryless•1d ago

i wonder what all this hype-driven overcapacity will be used for by future generations.

once this bubble pops we are going to have some serious albeit high-latency hardware

zaphar•1d ago

Crunching really large amounts of numbers has always been useful. And that's all this really is. Running weather simulations, Advanced math problems, Complicated engineering simulations. The space of possible uses is incredibly wide.

otherjason•1d ago

The last few generations of GPU architectures have been increasingly optimized for massive throughput of low-precision integer arithmetic operations, though, which are not useful for any of those other applications.

esafak•1d ago

What overcapacity? People are struggling to find affordable GPUs.

happyopossum•1d ago

> overcapacity

I’m Not sure that word means what you think it means. There is a pretty severe shortage of GPU capacity in the industry right now.

ivape•1d ago

Does anyone actually run a modest sized app and can share numbers on what one gpu gets you? Assuming something like vllm for concurrent requests, what kind of throughput are you seeing? Serving an LLM just feels like a nightmare.

gabe_monroy•1d ago

i'm the vp/gm responsible for cloud run and GKE. great to see the interest in this! happy to answer questions on this thread.

treksis•1d ago

Everything good except the price.

albeebe1•1d ago

Oh this is great news. After a $1000 bill running a model on vertex.ai continuously for a little test i forgot to shut down, this will be my go to now. I've been using Cloud Run for years running production microservices, and little hobby projects and i've found it simple and cost effective.

pier25•1d ago

How does this compare to Fly GPUs in terms of pricing?

einpoklum•1d ago

Why is commercial advertising published as a content article here?

m1•1d ago

Love cloud run and this looks like a great addition. Only things I wish from cloud run is being able to run self hosted GitHub runners on it (last time I checked this wasn’t possible as it requires root), also the new worker pool feature seems great in practice but it looks like you have to write the scaler yourself rather than it being built in.

aniruddhc•1d ago

Hi! I'm the Eng Manager responsible for Autoscaling for Serverless and Worker Pools.

We're actively defining our roadmap, and understanding your use case would be incredibly valuable. If you're open to it, please email me at <my HN username>@google.com. I'd love to learn more about how you'd use worker pools and what kind of workloads you need to scale.

felix_tech•10h ago

I've been using this for daily/weekly ETL tasks which saves quite a lot of money vs having an instance on all the time but it's been clunky.

The main issue is despite there being a 60 minute timeout available the API will just straight up not return a response code if your request takes > ~5 minutes in most cases so you gotta make sure you can poll where the datas being stored and let the client time out

covi•2h ago

Take a look at SkyPilot. Good for running these batch workloads. You can use spot instances to save costs.

Debugging Deadlocks in PostgreSQL

Ask HN: Is Adrian Colyer of "The Morning Paper" fame ok?

Hello World

Better Contract Drafting (2023)

NATS Server 2.11 Release

LISA: Linux Integration Services Automation by Microsoft

UK Court Rules on Reverse Engineering of Mainframe Software

Nano-structured antibiofilm coatings based on recombinant resilin

How we’re responding to The NYT’s data demands in order to protect user privacy

Senate response to White House budget for NASA: Keep SLS, Nix science

Anthropic co-founder on cutting access to Windsurf

Conti Ransomware gang hackers exposed with photo identity via cyber attack

Spegion: Implicit and Non-Lexical Regions with Sized Allocations

LowProfile – Mac utility to help inspect Apple Configuration Profile payloads

Benny is a modular software playground for making live music

Ispace SMBC X Hakuto-R Venture Moon: Post Landing Conference

Fabric Chat – AI Multiplayer Chat

Champion-level drone racing using deep reinforcement learning (2023)

Show HN: Bearchat.ai, the place to test AI models

Curious humpback whales approach humans and blow bubble 'smoke' rings

Show HN: Cursor for Tinder, LinkedIn, and others: just Cmd-K

Man who considered assisted death after bedsore: you have to fight for care

My Virtual Wardrobe

The New American Civil War Game

I Love Movies, by Chat GPT

Llmblog – an LLM blogging about itself and building its own blog in real time

Why We're Nerfing the Nintendo Switch's Repairability Score

A Culture War Is Brewing over Moral Concern for AI

Web Design: A Renaissance in 2025

Show HN: Lambduck, a Functional Programming Brainfuck

Debugging Deadlocks in PostgreSQL

Ask HN: Is Adrian Colyer of "The Morning Paper" fame ok?

Hello World

Better Contract Drafting (2023)

NATS Server 2.11 Release

LISA: Linux Integration Services Automation by Microsoft

UK Court Rules on Reverse Engineering of Mainframe Software

Nano-structured antibiofilm coatings based on recombinant resilin

How we’re responding to The NYT’s data demands in order to protect user privacy

Senate response to White House budget for NASA: Keep SLS, Nix science

Anthropic co-founder on cutting access to Windsurf

Conti Ransomware gang hackers exposed with photo identity via cyber attack

Spegion: Implicit and Non-Lexical Regions with Sized Allocations

LowProfile – Mac utility to help inspect Apple Configuration Profile payloads

Benny is a modular software playground for making live music

Ispace SMBC X Hakuto-R Venture Moon: Post Landing Conference

Fabric Chat – AI Multiplayer Chat

Champion-level drone racing using deep reinforcement learning (2023)

Show HN: Bearchat.ai, the place to test AI models

Curious humpback whales approach humans and blow bubble 'smoke' rings

Show HN: Cursor for Tinder, LinkedIn, and others: just Cmd-K

Man who considered assisted death after bedsore: you have to fight for care

My Virtual Wardrobe

The New American Civil War Game

I Love Movies, by Chat GPT

Llmblog – an LLM blogging about itself and building its own blog in real time

Why We're Nerfing the Nintendo Switch's Repairability Score

A Culture War Is Brewing over Moral Concern for AI

Web Design: A Renaissance in 2025

Show HN: Lambduck, a Functional Programming Brainfuck

Cloud Run GPUs, now GA, makes running AI workloads easier for everyone

Comments