Imagine, you have a very small weak model, and you have to wait 20 seconds for your request.
For your first request, after having scaled to 0 while it wasn’t in use. For a lot of use cases, that sounds great.
Counterintuitively (again, not joking): gen2 suffers from really bad startup speeds, because its more like a full-on linux VM/container than whatever weird shim environment gen1 runs. My Gen2 containers basically never start up faster than 3 seconds. Gen1 is much faster.
Note that gen1 and gen2 Cloud Run execution environments are an entirely different concept than first generation and second generation Cloud Functions. First gen Cloud Functions are their own thing. Second generation Cloud Functions can be either first generation or second generation Cloud Run workloads, because they default to the default execution environment. Believe it or not, humans made this.
With Cloud Run, AFAIK, spending can effectively be capped by: limiting concurrency, plus limiting the max number of instances it can scale to. (But this is not as good as GCP having a proper cap.)
I get why it is a business strategy to not have limits .. but I wonder if providers would get more usage if people had more trusts on costs/predictability.
He looked completely surprised when I asked about runaway billing and why there wasn't any simple options to cap a given resource to prevent those cases.
His response was that they didn't build that because none of their customers wanted anything like that, as far as he was aware.
I think the reason this doesn’t get prioritized is that large customers don’t actually want a “stop serving if I pass this limit” amount. If there’s a spike in traffic, they probably would rather pay the money to serve it. The customers that would want this feature are small-dollar customers, and from an economic perspective it makes less sense to prioritize this feature, since they’re not spending very much relative to customers who wouldn’t want this feature.
Maybe if there weren’t more feature requests to get prioritized this might happen, but the reality is that there are always more feature requests than time to implement them, and a feature request used almost exclusively by the smallest dollar customers will always lose to a feature for big-dollar customers.
Removing a major concern that prevents individuals / small customers from using GCP in the first place; so more of them do use it
That could then lead to value in two ways:
- They make small projects that go on to be large projects later, (e.g. a small app that grows / becomes successful, becomes a moneymaker)
- Or, they might then be more inclined to get their big corp to use GCP later on, if they've already been using it as an individual
But that's long term, and hard to measure / put a number on
Having implemented this on behalf of others several times, I'll share the common pain points: * There's a long lead time. You need to enable Cost Explorer (24-48 hours). If you're trying for fine distinctions, activating tags as cost allocation tags is another 24 hours * AWS cost data is a lagging indicator, so you need to be able to absorb a day of charges * Automation support is poor, especially for organizations * Organization budgets configured at the account level are misleading if you don't understand how they're configured
What's really wanted here is that AWS needs to commit to more timely cost data delivery such that you can create an hourly budget with an associated action.
Followed by a list of caveats that make it wholly irrelevant for an individual who is afraid of a surprise charge covering less than several days.
... right up until it's their own bottom line that is at risk, and then like magic spending limits become a critical feature.
For example, Azure has no stop-loss feature for paid customers, but it does for the "free" Visual Studio subscriber credits. Because if some random dev with a VS subscription blows through $100K of GPU time due to a missing spending constraint, that's Microsoft's problem, not their own.
It's as simple as that.
What is the strategy? Is is purely market segmentation? (As in: "If you need to worry about spending too much, you're not the big-money kind of enterprise customer we want"?)
But, looking from the outside, the lack of protection is effectively a win for them. They don't need to invest in building that out, and their revenue is increased by not having it (if you ignore the effect of throttling adoption). So I have always assumed that there is simply no business case for that, so why bother?
It's coarse because it's daily and not hourly. However, you could also self-service do some of this with CloudWatch metrics to map to a cost and then have an alarm action.
https://aws.amazon.com/blogs/mt/manage-cost-overruns-part-1/
Feature for Google's profits.
[0] https://cloud.google.com/billing/docs/how-to/disable-billing...
Cap billing, and you have created an outage waiting to happen, one that will be triggered if they ever have sudden success growth.
Don't cap billing, and you have created a bankruptcy waiting to happen.
Also, hard dollar caps are rarely if ever the right choice. App Engine used to have these, and the practical effect was that your website would completely stop working exactly when you least want it to (posted on HN etc).
It's better to set billing alerts and make the call yourself if they go off.
Depends on if you're a big business or an individual. There is absolutely no reason I would ever pay $100k for a traffic burst on my personal site or side project (like the $100k Netlify case a few months ago).
> It's better to set billing alerts and make the call yourself if they go off.
Billing alerts are not instant and neither is anyone online 24x7 monitoring the alerts.
I think it is .
1) They make money for services they provided instead of looking into meaning of what customer actually wanted.
2) Small time customers move away so they concentrate energy on big enterprise sales.
Not justifying anything here but it just kind of make business sense for them.
This made me laugh out loud, thank you for this!
performance is okay, ada lovelace has cuda 8_9 support which brings native fp8 support. imo the best aspect is the speed of spinning up new containers and the overall easiness of this service. the live demo at google next 25 was quite something https://www.youtube.com/watch?v=PWPvX25R6dM&t=2140s
Interesting to see a big provider entering this space. Originally swapped to Modal because big providers weren’t offering this (e.g. AWS lambdas can’t run on GPU instances). Assuming all providers are going to start moving towards offering this?
Modal has the fastest cold-start I’ve seen for 10GB+ models.
Coiled is another option worth looking at if you're a Python developer. Not nearly as fast on cold start as Modal, but similarly easy to use and great for spinning up GPU-backed VMs for bursty workloads. Everything runs in your cloud account. The built-in package sync is also pretty nice, it auto-installs CUDA drivers and Python dependencies from your local dev context.
(Disclaimer: I work with Coiled, but genuinely think it's a good option for GPU serverless-ish workflows. )
Why bother when you can get payg API access to popular open weights models like Llama on Vertex AI model garden or at the edge on Cloudflare?
We use this, pretty convenient and less hassle than managing our autoscaling GPU pools.
1x L4 24GB: google: $0.71; runpod.io: $0.43, spot: $0.22
4x L4 24GB: google: $4.00; runpod.io: $1.72, spot: $0.88
1x A100 80GB: google: $5.07; runpod.io: $1.64, spot: $0.82; vast.ai $0.880, spot: $0.501
1x H100 80GB: google: $11.06; runpod.io: $2.79, spot: $1.65; vast.ai $1.535, spot: $0.473
8x H200 141GB: google: $88.08; runpod.io: $31.92; vast.ai $15.470, spot: $14.563
Google's pricing also assumes you're running it 24/7 for an entire month, where as this is just the hourly price for runpod.io or vast.ai which both bill per second. Wasn't able to find Google's spot pricing for GPUs.When you need under <1hr than you can go with Runpod's Spot pricing which is ~4-7x cheaper than Google, where even 20min of Google would cost more than 1hr on RunPod.
What makes you think that?
Cloud Run [pricing page](https://cloud.google.com/run/pricing) explicitly says : "charge you only for the resources you use, rounded up to the nearest 100 millisecond"
Also, Cloud Run's [autoscalling](https://cloud.google.com/run/docs/about-instance-autoscaling) is in effect, scaling down idle instances after a maximum of 15 minutes.
(Cloud Run PM)
Eg GCP price for spot 1xH100 is $2.55/hr, lower with sustained use discounts. But only hobbyists pay these prices, any company is going to ask for a discount and will get it.
Right now nothing is consumer friendly. I can’t get a packaged deal of some locally running ChatGPT quality UI or voice command system in an all in one package. Like what Macs did for PCs I want the same for AI.
The only time I used a Pascal compiler in ISO Pascal mode, it had the usual extensions inspired on UCSD, but we weren't allowed to use them on the assignments.
I want an Amazon echo agent running my home with a locally running LLM.
Maxsun is releasing a a 48GB dual Intel Arc Pro B60 GPU. It's expected to cost ~$1000.
So for around $4k you should be able to build an 8 core 192GB local AI system, which would allow you to locally run some decent models.
This also assumes the community builds an intel workflow, but given how greedy Nvidia is with vram, it seems poised to be a hit.
Also kind of a nitpick, but I'd call that 8 GPU system, each BMG-G21 die has 20 Xe2 cores. Also even though it would be 4 PCIe cards it is probably best to think of it as 8 GPUs (that's how it will show up in stuff like pytorch), especially because their is no high-speed interconnect between the GPU dies colocated on the card. Also if you're going to do this make sure you get a motherboard with good PCIe bifurcation support.
This is my biggest pet-peeve with serverless GPU. 19 seconds is a horrible latency from the user’s perspective and that’s a best case scenario.
If this is the best one of the most experienced teams in the world can do, with a small 4B model, then it feels like serverless is really restricted to non-interactive use cases.
For scaling N --> N+1 - If you configure the correct concurrency value (the number of parallel requests one instance can handle), Cloud Run will scale up to additional instances when getting to X% (I think it's 70%). That will be before the instance is fully exhausted. So your users should not experience the 19 seconds cold start.
At that point, 19 seconds looks great, as lower latency startup times allow for much more efficient autoscaling.
1x A100 80Gb 1.37€/hour
1x H100 80Gb 2.19€/hour
Sign up for new GPU types at https://docs.google.com/forms/d/e/1FAIpQLSdZk5sCsDUjAoYQX-sq...
Once you compare the numbers it is better to use a VM + GPU if the utilization of your service is even only for 30% of the day.
1 - https://ashishb.net/programming/free-deployment-of-side-proj...
Other products (App Engine standard, Cloud Functions gen1, Cloud Run, Cloud Run Functions) share many underlying infrastructures.
The problem is continuous product churn. This was discussed at length at https://news.ycombinator.com/item?id=41614795
I'd love to see the numbers for Cloud Run. It's nice for toy projects, but it's a money sink for anything serious, at least from my experience. On one project, we had a long-standing issue with G regarding autoscaling - scaling to zero sounds nice on paper, but they will not mention you the warmup phases where CR can spin up multiple containers for a single request and keep them for a while. And good luck hunting for unexplainedly running containers when there are no apparent cpu or network uses (G will happily charge you for this).
Additionally, startup is often abysmal with Java and Python projects (although it might perform better with Go/C++/Rust projects, but I don't have experience running those on CR).
This is really not my experience with Cloud Run at all. We've found it to actually be quite cost effective for a lot of different types of systems. For example, we ended up helping a customer migrate a ~$5B/year ecommerce platform onto it (mostly Java/Spring and Typescript services). We originally told them they should target GKE but they were adamant about serverless and it ended up being a perfect fit. They were paying like $5k/mo which is absurdly cheap for a platform generating that kind of revenue.
I guess it depends on the nature of each workload, but for businesses that tend to "follow the sun" I've found it to be a great solution, especially when you consider how little operations overhead there is with it.
We're now investigating moving to Kubernetes where we will have more control over our destiny. Thankfully a couple people on the team have experience with this.
Something like this never happened with Fargate in the years my previous team had used that.
The authorization and auditing features are designed for internal tools, any app can be deployed otherwise.
Does Cloud Run give you root?
And it looks like Cloud Run can do something Lambda can't: https://cloud.google.com/run/docs/create-jobs . "Unlike a Cloud Run service, which listens for and serves requests, a Cloud Run job only runs its tasks and exits when finished. A job does not listen for or serve requests."
Google built crosvm which was the initial inspiration for firecracker, but Cloud Run runs on top of Borg (this fact is publicly documented). Borg is closed source, so it's possible the specific hypervisor they're using is as well.
I’m fairly experienced with GCP, but even then, the billing model here caught me off guard. When you’re dealing with machines that can run up to $64K/month, small missteps get expensive quickly. Predictability is key, and I’d love to see more safeguards or clearer cost modeling tooling around these types of workloads.
Indeed. IIRC, if you get a single request every 15 mins (~100 requests a day), you will pay for Cloud Run GPU for the full day.
So if you get all your requests in a 2 hours window then that's great. It will scale to zero for rest of the 22 hours.
However, if you get at least one request every 15 mins then you will pay for 24 hours and it is ~3X more expensive then equivalent VM on Google Cloud.
All the major clouds are suffering from this. AWS you can't ever get an 80gb gpu without a long term reserve and even then it's wildly expensive. GCP you can sometimes but its also insanely expensive.
These companies claim to be "startup friendly", they are anything but. All the neo-clouds somehow manage to do this well (runpod, nebius, lambda) but the big clouds are just milking enterprise customers who won't leave and in the process screwing over the startups.
This is a massive mistake they are making, which will hurt their long term growth significantly.
If anyone is curious about these neo-clouds, a YC startup called Shadeform has their availability and pricing in a live database here: https://www.shadeform.ai/instances
They have a platform where you can deploy VMs and bare metal from 20 or so popular ones like Lambda, Nebius, Scaleway, etc.
All the while saying they are "startup friendly".
$ sky launch --gpus H100
will fall back across GCP regions, AWS, your clusters, etc. There are options to say try either H100 or H200 or A100 or <insert>.
Essentially the way you deal with it is to increase the infra search space.
You go there because you are already there or have contracts etc etc
once this bubble pops we are going to have some serious albeit high-latency hardware
I’m Not sure that word means what you think it means. There is a pretty severe shortage of GPU capacity in the industry right now.
We're actively defining our roadmap, and understanding your use case would be incredibly valuable. If you're open to it, please email me at <my HN username>@google.com. I'd love to learn more about how you'd use worker pools and what kind of workloads you need to scale.
The main issue is despite there being a 60 minute timeout available the API will just straight up not return a response code if your request takes > ~5 minutes in most cases so you gotta make sure you can poll where the datas being stored and let the client time out
lemming•1d ago
42lux•1d ago
_zoltan_•1d ago
42lux•1d ago
lexandstuff•1d ago