AWS Lambda Silent Crash – A Platform Failure, Not an Application Bug [pdf]

https://lyons-den.com/whitepapers/aws-lambda-silent-crash.pdf

58•nonfamous•5h ago

Comments

elzbardico•4h ago

Geez man, couldn't you just try rewriting this small piece of code in another language?

pensatoio•4h ago

While I have no empathy or patience for Amazon’s support model, you’re really asking for it when you write backend code in Node.

I get that some big businesses are using JavaScript on the backend, but I’ll bet almost every single one would be better off switching to a type-safe, compiled language.

malfist•4h ago

Maybe share the code so others can see?

ceejayoz•3h ago

I saw this come up on /r/aws a few days back.

This response seemed illuminating:

https://www.reddit.com/r/aws/comments/1lxfblk/comment/n2qww9...

> Looking at (this section)[https://will-o.co/gf4St6hrhY.png], it seems like you're trying to queue up an asyncronous task and then return a response. But when a Lambda handler returns a response, that's the end of execution. You can't return an HTTP response and then do more work after that in the same execution; it's just not a capability of the platform. This is documented behavior: "Your function runs until the handler returns a response, exits, or times out". After you return the object with the message, execution will immediately stop even if other tasks had been queued up.

mcflubbins•3h ago

If this is the case (it might very well be) I do at least find it odd that no one on AWS' side was able to explain this to them.

ceejayoz•3h ago

Some people have a lot of certainty coupled with a lack of accuracy.

semiquaver•3h ago

Based on the way the document is written, it seems very likely that several people did realize exactly what the misunderstanding was and try to explain it to them.

somethingAlex•3h ago

It may just be such a basic tenant of the platform that no one thought to. You stop getting billed after lambda returns a response so why would you expect computation to continue? This guy expected free lunch.

charcircuit•2h ago

I don't see that on the main AWS Lambda pages. It just says that you pay for what you use. It would make sense that the time billed would be until there is no more code to execute.

meepmorp•2h ago

Yeah, but if you actually read the developer documentation, they explain the execution model pretty extensively.

somethingAlex•2h ago

Fair enough, I guess this just seems like a bold assumption to make since an explicit handler function is a cornerstone of lambda, rather than being able to run module level code and having the end self detected.

m3sta•3h ago

AWS has a lot of employees. Many of them don't know AWS very well.

scarface_74•2h ago

No AWS employee knows “AWS well”. AWS has a huge surface area. If you work on one of the service teams - ie the team that maintains the different AWS services - you are very much unlikely to know the entire AWS surface area and be focused on your team and surrounding services.

It use to be the case if you were interviewing for an SDE position, you were specifically told not to mention specific AWS services in the system design rounds and speak of generic technologies.

jamesfinlayson•2h ago

I remember interviewing a guy who worked at Amazon - he commented that he started trusting AWS much less after getting to know some of the developers who worked on AWS.

eddythompson80•2h ago

They did (in a typical non-fault admitting way). They didn't escalate to Lambda engineering team, then said that this is a code issue, and that they should move to EC2 or Fargate, which is the polite way of saying "you can't do that on lambda. its your issue. no refunds. try fargate."

OP seems to be fixated on wanting MicroVM logs from AWS to help them correlate their "crash", but there likely no logs support can share with them. The microVM is suspended in a way you can't really replicate or test locally. "Just don't assume you can do background processing". Also to be clear, AWS used to allow the microVM to run for a bit after the response is completed to make sure anything small like that has been done.

It's an nondeterministic part of the platform. You usually don't run into it until some library start misbehaving for one reason or another. To be clear, it does break applications and it's a common support topic. The main support response is to "move to EC2 or fargate" if it doesn't work for you. Trying to debug and diagnose the lambda code with the customer is out of support scope.

xer0x•2h ago

There is a "post-invocation phase" for tidying up. It's not long, and it's enough for things like Datadog's metrics plugin to send off some numbers. Yet it will fail, if you have too many metrics. It's fuzzy.

Scaevolus•2h ago

You only get that when you register extensions: https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-...

johnduhart•3h ago

Oh wow, a 23-page write up about how the author misunderstood AWS Lambda's execution model [1].

> It emits an event, then immediately returns a response — meaning it always reports success (201), regardless of whether the downstream email handler succeeds or fails.

It should be understood that after Lambda returns a response the MicroVM is suspending, interrupting your background HTTP request. There is zero guarantee that the request would succeed.

1: https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-...

kevin_nisbet•3h ago

I don’t know about node but a fun abuse of this is background tasks can still sometimes run on a busy lambda as the same process will unsuspend and resuspend the same process. So you can abuse this sometimes for non essential background tasks and to keep things like caches in process. You just cant rely on this since the runtime instead might just cycle out the suspended lambda.

johnduhart•3h ago

Absolutely, I do this at $dayjob to update feature flags and refresh config. Your code just needs to understand that such execution is not guaranteed to happened, and in-flight requests may get interrupted and should be retried.

nemothekid•3h ago

If I'm reading this correctly, then AWS Support dropped the ball here but this isn't a bug in lambda. This is the documented behavior of the lambda runtime.

The document is long, and the examples seem contrived, so anyone is free to correct me but as I understand it the lambda didn't crash, after you returned 201, your lambda instance was put to sleep. You aren't guaranteed that any code will remain running after your lambda "ends". I am not sure why AWS Support was unable to communicate this OP.

If you are using Lambda with a function URL, you aren't guaranteed that anything after you return your http response remains running. I believe Lambda has some callbacks/signals you can listen to, to ensure your function properly cleans up before the Lambda is frozen, but if you want the lambda to return as fast as possible it seems you are better off having your service publish to an SQS queue instead.

Everdred2dx•3h ago

Yep. I've ran into this using Bugsnag for reporting on unhandled exceptions in Python-based lambda functions. The exception handler would get called, but because the library is async by default the HTTP request wouldn't make it out before the runtime was torn down.

I sympathize with OP because debugging this was painful, but I'm sorry to say this is sort of just a "you're holding it wrong" situation.

semiquaver•3h ago

This document is bizarre. The author is so confidently verbose about something they are clearly misunderstanding, and have been told as much dozens of times. It’s humbling, in a way, to think of times I’ve felt this strongly about something and to consider the possibility I could have been this wrong.

tough•3h ago

stubbornness is a very human thing

johnfn•2h ago

It's pretty clearly written by GPT.

x3n0ph3n3•2h ago

Can you share some of the tell-tale signs you pick up on?

Edit: I see he's the CTO of an AI company.

johnfn•2h ago

It's never a slam-dunk with AI-generated content, but some of the signs I notice are:

- Constant bulleted lists with snappy, non-punctuated items.

- Single word sentences to emphasize a point ("It was disciplined. Technical. Forensic.")

- The phrasing. e.g. the part about submitting to Reddit: "This response was disproportionate to the activity involved: a first-time technical post, written with precision, submitted to the relevant forum, and not yet visible to any other user." Who on earth says "written with precision" about their own writing?

To be clear, I don't think it's a fabricated account. I also don't think it was a one-shot. OP probably iterated on it with GPT for quite some time.

appreciatorBus•2h ago

He’s already added an entry about the whole incident to his resume:

https://lyons-den.com/CV/David_Lyon_CTO_CV_2025.pdf

EDIT: he has added three(!) separate mentions of the same incident to his résumé

tczMUFlmoNk•2h ago

At least it is helpfully located in the "Thought Leadership" section as an early flag for the reader.

mrlatinos•2h ago

This has GPT written all over it: "That’s not just a runtime bug. That’s a reliability risk baked into the ecosystem."

For whatever reason, it constantly uses rhetorical reclassification like “That’s not just X. That’s Y.” when it's trying to make a point.

In GPT's own words:

``` ### Why GPT uses it so often:

- It sounds insightful and persuasive with minimal complexity. - It gives an impression of depth by moving from the obvious to the interpretive. - It matches common patterns in blog posts, opinion writing, and analyst reports. ```

I think that last point is probably the most important.

happytoexplain•3h ago

Consider that this conversation is even being had - that you are even in the situation to be writing what you are writing.

"Failure" != "Bug"

encomiast•3h ago

> I am not sure why AWS Support was unable to communicate this OP.

Maybe we should be asking why the OP was not able to hear what AWS was telling them. I think there is a fairly troublesome cognitive bias that gets flipped on when people are telling you that you're wrong — especially when all your personal branding and identity seems to be about you being right.

cldcntrl•3h ago

> If I'm reading this correctly, then AWS Support dropped the ball here but this isn't a bug in lambda. This is the documented behavior of the lambda runtime.

AWS Support is generally ineffective unless you're stuck on something very simple at a higher level of the platform (e.g. misunderstanding an SDK API).

Even with their higher tier support - where you can summon a subject matter expert via Chime almost instantly - they're often clueless, and will confidently pass you misleading or incorrect information just to get you off the line. I've successfully used them as a very expensive rubber ducky, but that's about it.

sneilan1•3h ago

`This wasn’t a request for clarification. It was a reset — a request for artefacts that have already been delivered, multiple times, across prior case updates, call notes, and debug summaries. The very same package AWS previously reviewed. The same logs they had already quoted. The same failure they had already confirmed.`

While the author is probably correct in this being a platform level bug (I have not ran the code to confirm myself though), they should stop being so angry at a corporation doing what corporations do which is run a big bureaucracy and require extraordinary amounts of extra communication. One of job requirements of a principle engineer, often making hundreds of thousands if not millions of dollars a year, is to communicate across levels because it is so hard to do this effectively without being angry or losing your cool.

`As a solo engineer, I outpaced AWS’s own diagnostics, rebuilt our deployment stack from scratch, eliminated every confounding factor, and forced AWS to inadvertently reproduce a failure they still refused to acknowledge.`,

I do not think that a solo engineer would have the economic clout to get AWS to pay attention to the bug you found. Finding a bug in a system does not imply any ability to fix a bug in a bureaucracy.

Plus, given the tone of the article, I have a funny feeling that the author may have rubbed people in AWS the wrong way, preventing more progress from being made on their bug.

But, a wise fellow once said, "all progress in this world depends upon the unreasonable man".

donkey_brains•3h ago

I didn’t bother to read this incredibly long write up past the introduction. But other commenters are asserting that the author expected their Lambda function to continue executing an async task after returning an HTTP response, despite numerous attempts by AWS staff to explain why that wouldn’t work. If true, that is _hilarious_. Easily one of the best stories of misguided anger/hubris/Dunning-Kruger I’ve ever heard. That’s like expecting a local function to continue executing after a `return` statement. It’s such a simple, basic principle of the platform they were trying to use.

But one thing bothers me…wouldn’t the author have encountered the exact same behavior in Azure? I guess I really will have to read this paper to find out.

Everdred2dx•3h ago

They mentioned “rebuilding” in EC2. Probably just moved that over to Azure and no longer are bothering with serverless :)

pryelluw•2h ago

I mean, they should just have created another queue, that triggers another lambda that then continues the process after the http response is sent.

However, it is still funny when you think that in regular servers you still finish when you respond to the request.

My dog keeps getting angry at the door. It won’t open automatically.

scarface_74•2h ago

AWS has a clear policy of not providing support for user application code. I’ve never worked at a company that had enterprise level support with a dedicated TAM. It might be different there.

However if it is worth anything I did work at AWS for 3.5 years (AWS ProServe). I knew almost immediately what the issue was as soon as I read the article and knew it was documented and expected behavior.

Analemma_•2h ago

This post is a good lesson on humility, and why emotionally-heated blog posts might not be a great idea, no matter how satisfying, and you should possibly stick to dry technicalities instead. If OP had just written up their experience without all the sneering and potshots at Amazon support, people would've just gently corrected them. But as it is they look like a tremendous asshole, and certainly not fit for a CTO role, either temperamentally or skill-wise.

Quarrel•1h ago

The poor guy has obviously wound himself in knots over this for several months. He went through all this, obviously really bothered by the impersonal Corporation, "solved" his problem by moving to Azure, but then wrote up this massive rant.

Sounds like he needs a holiday at the least.

The mere idea that reddit has a special Amazon protection team that is the cause of his account suspension is ludicrous.

Hopefully this exposure will help him out.

reilly3000•1h ago

What a tremendous waste of resources. As a startup engineer and leader your job is to build value for customers and investors as quickly and effectively as possible, knowing the privilege of their participation with you is fleeting. The time spent in those meetings or writing this vitriolic pdf would have gone a long way towards shipping features, even if you had to survive with a suboptimal approach for a while. Time isn’t cheap and this cost a fortune.

crinkly•45m ago

Whilst this is a combination of user error and shitty support (not unexpected) I’ve seen Amazon directly not understand how their own lambda stuff works before. There is a pretty horrible bug in one of their higher level products that used lambda because they didn’t know that an execution environment can be reused between invocations. Their software assumed entirely incorrectly that it was stateless resulting in us having to, as the first line in the lambda, clean up the previous invocation’s data on the filesystem. After 3 years they haven’t fixed this as far as I am aware. Getting the actual engineers who wrote this on a call was unproductive as well suggesting it would take months to sort out. We built our own in the end.

Am I Becoming Irrelevant?

Rocket engine designed by generative AI just completed its first hot fire test

W3 Servers (1992)

My Office Chair Broke My PC for a Year

Cyberpunk and Politics: Neon Dystopias, Power, and Resistance

Come play Game take coin

React-map-gl OR Mapbox-gl

Human: Everything you need to know about the most extraordinary story of all

Show HN: Tap BPM

An Unholy Alliance

Why Every Startup Will Be AI Native – Or Die Trying

Gmail's new 'Manage subscriptions' tool will help declutter your inbox

AX: Agent Experience

Linux can run in a Latent Diffusion Model

Smart Tax and E-Invoicing Software

Nvidia to Resume H20 AI Chip Sales to China in US Reversal

Delegation is the AI Metric that Matters

Doing web things with CGIs is mostly no longer a good idea

Mayo project uses kites to generate electricity

A look at IBM's short-lived "butterfly" ThinkPad 701 of 1995

AI finds potential antibiotics in snake and spider venom

Treat the Internet Like Real Life

Godot Showcase: Dogwalk

Friendship promotes neural and behavioral similarity

The Trump Administration Is About to Incinerate 500 Tons of Emergency Food

Ask HN: What's your favorite book you've read?

Django – The Origin Story

Algolia for Seraching Hacker News

Trial Court Decides Case Based on AI-Hallucinated Caselaw

Belgian CVD is deeply broken