Unlikely? They could've just as well deployed their single go binary to a vm from day 1 and it would've been smooth sailing for their use case, while they acquire customers.
The cloudflare workers they chose aren't really suited for latency critical, high throughput APIs they were designing.
That said, as an example, an m8g.8xlarge gives you 32 vCPU / 128 GiB RAM for about $1000/month in us-east-1 for current on-demand pricing, and that drops to just under $700 if you can do a 1-year RI. I’m guessing this application isn’t super memory-heavy, so you could save even more by switching to the c-family: same vCPU, half the RAM.
Stick two of those behind a load balancer, and you have more compute than a lot of places actually need.
Or, if you have anything resembling PMF, spend $10K or so on a few used servers and put them into some good colo providers. They’ll do hardware replacement for you (for a fee).
Source work somewhere where you easily get 1ms cached relational DB reads from outside the service.
30ms makes me suspect it went cross region.
I’m assuming you’re an employee of the company based on your comments, so please don’t take this poorly - I applaud any and all public efforts to bring back sanity to modern architecture, especially with objective metrics.
And yeah you’re right in hindsight it was a terrible idea to begin with
I thought it could work but didn’t benchmark it enough and didn’t plan enough. It all looked great in early POCs and all of these issues cropped up as we built it
"Serverless was fighting us" vs "We didn't understand serverless tradeoffs" - one is a learning experience, the other is misdirected criticism.
But here I dont think they (or their defenders) are still aware of the real lesson here.
Theres literally zero information thats valuable here. Its like saying "we used an 18 wheeler as our family car and then we switched over to a regular camry and solved all our problems." What is the lesson to be learned in that statement?
The real interesting post mortem would be if they go, "god in retrospect what a stupid decision we took; what were we thinking? Why did we not take a step back earlier and think, why are we doing it this way?" If they wrote a blog post that way, that would likely have amazing takeaways.
Not sure what the different takeaways would be though?
Im genuinely curious because this is not singling out your team or org, this is a very common occurrence among modern engineering teams, and I've often found myself on the losing end of such arguments. So I am all ears to hear at least one such team telling what goes on in their mind when they make terrible architecture decisions and if they learned anything philosophical that would prevent a repeat.
I was working on it on and off moving one endpoint at a time but it was very slow until we hired someone who was able to focus on it.
It didn’t feel good at all. We knew the product had massive flaws due to the latency but couldn’t address it quickly. Especially cause we he to build more workarounds as time went on. Workarounds we knew would be made redundant by the reimplementation.
I think we had that discussion if “wtf are we doing here” pretty early, but we didn’t act on it in the beginning, instead we tried different approaches to make it work within the serverless constraints cause that’s what we knew well.
Isn’t this the whole point of serverless edge?
It’s understood to be more complex, with more vendor lockin, and more expensive.
Trade off is that it’s better supported and faster by being on the edge.
Why would anyone bother to learn a proprietary platform for non critical, latency agnostic service?
The whole point of edge is NOT to make latency-critical APIs with heavy state requirements faster. It's to make stateless operations faster. Using it for the former is exactly the mismatch I'm describing.
Their 30ms+ cache reads vs sub-10ms target latency proves this. Edge proximity can't save you when your architecture adds 3x your latency budget per cache hit.
I wonder if there is anything other than good engineering getting in the way of this and even sub us intra-process pull through caches for busy lambda functions. After all, if my lambda is getting called 1000X per second from the same point of presence, why wouldn't they keep the process in memory?
That's hot start VS cold start.
This may or may not matter to you depending on your application’s needs, but there is a significant performance difference between, say, an m4 family (Haswell / Broadwell) and an m7i family (Sapphire Rapids) - literally a decade of hardware improvements. Memory performance in particular can be a huge hit for latency-sensitive applications.
Most cloud pain people experience is from a misunderstanding / abuse of solutions architecture and could have been avoided with a more thoughtful design. It tends to be a people problem, not a tool problem.
However, in my experience cloud vendors sell the snot out of their offerings, and the documentation is closer to marketing than truthful technical documentation. Their products’ genuine performance is a closely guarded proprietary secret, and the only way to find out… e.g. whether Lambdas are fast enough for your use case, or whether AWS RDS cross-region replication is good enough for you… is to run your own performance testing.
I’ve been burned enough times by AWS making it difficult to figure out exactly how performant their services are, and I’ve learned to test everything myself for the workloads I’ll be running.
I know about Anycast but not how to make it operational for dynamic web products (not like CDN static assets). Any tips on this?
I participated in AWS training and certification given by AWS for a company to obtain a government contract and I can 100% say that the PAID TRAINING itself is also 100% marketing and developer evangelism.
AWS will hopefully be reduced to natural language soon enough with AI, and their product team can move on (most likely they moved on a long time ago, and the revolving door at the company meant it was going remain a shittily thought out platform in long term maintenance).
I think they are shooting themselves in the foot with this approach. If you have to run a monte carlo simulation on every one of their services at your own time and expense just to understand performance and costs, people will naturally shy away from such black boxes.
I don't this isn't true. In fact, it seems that in the industry, many developers don't proceed with caution and go straight into usage, only to find the problems later down the road. This is a result of intense marketing on the part of cloud providers.
I think cause connections can be reused more often. Cloud flare workers are really prone to doing a lot of TLS handshakes cause they spin up new ones constantly
Right now were just hang aws far hate for the go servers, so there really isn’t much maintenance at all. We’ll be moving that into eks soon though cause we are starting to add more stuff and need k8s anyways
Unfortunately too many comments here are quick to come to the wrong conclusion, based only on the title. Not a reason to change it though!
It’s totally fair criticism that the title and wording is a bit clickbaity
But that’s ok
- Eliminated complex caching workarounds and data pipeline overhead
- Simplified architecture from distributed system to straightforward application
We, as developers/engineers (put whatever title you want), tend to make things complex for no reason sometimes. Not all systems have to follow state-of-the-art best practices. Many times, secure, stable, durable systems outperform these fancy techs and inventions. Don't get me wrong, I love to use all of these technologies and fancy stuff, but sometimes that old, boring, monolithic API running on an EC2 solves 98% of your business problems, so no need to introduce ECS, K8S, Serverless, or whatever.
Anyway, I guess I'm getting old, or I understand the value of a resilient system, and I'm trying to find peace xD.
"Down with serverless! Long live serverless!"
While it "takes away" some work from you, it adds this work on other points to solve the "artificial induced problems".
Another example i hit was a hard upload limit. Ported an application to a serverless variant, had an import API for huge customer exports. Shouldnt be a problem right? Just setup an ingest endpoint and some background workers to process the data.
Tho than i learned : i cant upload more than 100mb at a time through the "api gateway" (basically their proxy to invoke your code) and when asking if i could change it somehow i just was told to tell our customers to upload smaller file chunks.
While from a "technical" perspective this sounds logical, our customers not gonne start exchanging all their software so we get a "nicer upload strategy".
For me this is comparable with "it works in a vacuum" type of things. Its cool in theory, but as soon it hits reality you will realice quite fast that the time and money you safed on changing from permanent running machines to serverless, you will spend in other ways to solve the serverless specialities.
Have the users upload to s3 directly and then they can either POST you what they uploaded or you can find some other means of correlating the input (eg: files in s3 are prefixed with the request id or something)
I agree this is annoying and maybe I’ve been in AWS ecosystem for too long.
However having an API that accepts an unbounded amount of data is a good recipe for DoS attacks, I suppose the 100MB is outdated as internet has gotten faster but eventually we do need some limit
I guess they never came out of MVP, which could warrant using serverless, but in the end it makes 0 sense to use some slow solution like this for the service they are offering.
Why didnt they go with a self hosted backend right away?
Its funny how nowadays most devs are too scared to roll their own and just go with the cloud offerings that cost them tech debt and actual money down the road.
We believed their docs/marketing without doing extensive benchmarks, which is on us.
The appeal was also to use the same typescript stack across everything, which was nice to work with
I wanted my app to be self-hostable as well, and Cloudflare worker is a hard ecosystem lock to their platform, which makes it undesirable (imo).
Here is a link to my reasoning from back then: https://github.com/K0IN/Notify/pull/77#issuecomment-16776070...
Also the vendor lock-in doesn’t help with durable objects and D2 instead of simply doing what supabase and others are doing by providing Postgres or standard SQLite as a service.
This tooling fetish hurts both companies and developers.
And that is actually the advantage of serverless, in my mind. For some low-traffic workloads, you can host for next to nothing. Per invocation, it is expensive, but if you only have a few invocations of a workload that isn't very latency sensitive, you can run an entirely serverless architecture for pennies per month.
Where people get burned is moving high traffic volumes to serverless... then they look at their bill and go, "Oh my god, what have I done!?" Or they try to throw all sorts of duct tape at serverless to make it highly performant, which is a fool's errand.
The industry is creating learned helplessness.
With standard Go servers, self-hosting becomes trivial:"
A key point that I always make. Serverless is good if you want a simple periodic task to run intermittently without worrying about a full time server. The moment things get more complex than that (which in real world it almost always is), you need a proper server.
pjmlp•1h ago
All major cloud vendors have serveless solutions based on containers, with longer managed lifetimes between requests, and naturally the ability to use properly AOT compiled languages on the containers.
OvervCW•1h ago
It reminds me of the companies that start building their application using a NoSQL database and then start building their own implementation of SQL on top of it.
zaphirplane•1h ago
CuriouslyC•1h ago
keyle•1h ago
Isn't serverless at the base the old model, of shared vms, except with a ton of people?
I'm old school I guess, baremetal for days...
pjmlp•1h ago
fabian2k•1h ago
OvervCW•1h ago
pjmlp•1h ago
Usually a decision factor between more serverless, or more DevOps salaries.
fabian2k•1h ago
pjmlp•1h ago
9rx•32m ago
Why's that? Serverless is just the generic name for CGI-like technologies, and CGI is exactly how classical web application were typically deployed historically, until Rails became such a large beast that it was too slow to continue using CGI, and thus running your application as a server to work around that problem in Rails pushed it to become the norm across the industry — at least until serverless became cool again.
Making your application the server is what is more complex with more moving parts. CGI was so much simpler, albeit with the performance tradeoff.
Perhaps certain implementations make things needlessly complex, but it is not clear why you think serverless must fundamentally be that way.
array_key_first•1h ago
pjmlp•1h ago
ramraj07•56m ago
johannes1234321•1h ago
It can make sense if you have very differing load, with few notable spikes or on an all in on managed services, where serverless things are event collectors from other services ("new file in object store" - trigger function to update some index)
CuriouslyC•1h ago
pjmlp•1h ago
CuriouslyC•1h ago
iainmerrick•21m ago
The nice thing about JS workers is that they can start really fast from cold. If you have low or irregular load, but latency is important, Cloudflare Workers or equivalent is a great solution (as the article says towards the end).
If you really need a full-featured container with AOT compiled code, won't that almost certainly have a longer cold startup time? In that scenario, surely you're better off with a dedicated server to minimise latency (assuming you care about latency). But then you lose the ability to scale down to zero, which is the key advantage of serverless.
pjmlp•13m ago
Serverless with containers is basically managed Kubernetes, where someone else has the headache to keep the whole infrastructure running.