The origin, as far as I know it. I think it still holds, is insightful, as a general case. Let it heal seems pretty close to what Joe was getting at.
We miss you Joe :)
Is this conceptually similar, but perhaps at code-level instead?
https://hexdocs.pm/elixir/1.18.4/Supervisor.html
BEAM apps run great on k8s.
For example, testing that kubernetes restarts work correctly is tricky and requires a complicated setup. Testing that an erlang process/actor behaves as expected is basically a unit test.
But that doesn't cover the behavior of your app, the specific configuration you ask kubernetes to use and how the app uses its health endpoints etc. - this is all purely about your own code/config, the kubernetes team can't test that.
Of course there are still errors that can't be recovered from, in which case the whole program may finally crash.
This may happen if you let it, but it's basically never the desired outcome. If you were handling a user request, it should stop by returning a HTTP 500 to the client, or if you were processing a background job of some sort, it should stop with a watchdog process marking the job as a failure, not with the entire system crashing.
The equivalent of "let it crash" outside of Erlang is a mountain of try-catch statements and hand-rolled retry wrappers with time delays, with none of the observability and tooling that you get in Erlang.
This is my favourite line, because it generalizes the underlying principle beyond the specific BEAM/OTP model in a way that carries over well to the more common sort of database-backed services that people tend to write.
Erlang is used in situations involving a zillion incoming requests. If an individual request fails… Maybe it was important. Maybe it wasn’t. If it was important, it’s expected they’ll try again. What’s most important is that the rest of the requests are not interrupted.
What makes Erlang different is that it is natural and trivial to be able to shut down an individual request on the event of an error without worrying about putting any other part of the system into a bad state.
You can pull this off in other languages via careful attention to the details of your request-handling code. But, the creators of the Erlang language and foundational frameworks have set their users up for success via careful attention to the design of the system as a whole.
That’s great in the contexts in which Erlang is used. But, in the context of a Java desktop app like Open Office, it’s more like saying “Let it throw”. “It” being some user action. And, the slogan being to have a language and framework with such robust exception handling built-in that error handling becomes trivial and nearly invisible.
Let it crash, because a relevant manager will detect it, report it, clean it up, and restart it, without you having to write a line of code for that.
Let it crash as soon as possible, so that any problem (like a crash loop) is readily visible. It's very easy to replace arbitrary bits of Erlang code in a running system, without affecting the rest of it. "Fix it in prod" is better than "miss it in prod", especially when you cannot stop the prod ever.
Erlang works by message passing and duck typing, so, as long as your interfaces are compatible (backwards or forwards), you can alter the implementation, and evolve the interfaces. Think microservices, but when every function can be a microservice, at an absolutely trivial cost.
+10. So many people miss this very important point. If you have lots of mutable shared state, or can accidentally leak such into your actor code then the whole actor/supervision tree thing falls over very easily... because you can't just restart any actor without worrying about the rest of the system.
I think this is a large (but not the only[0]) part of why actors/supervisors haven't really caught on anywhere outside of Erlang, even for problem spaces where they would be suitable.
[0] I personally feel the model is very hard to reason about compared to threaded/blocking straight-line code using e.g. structured concurrency, but that may just be a me thing.
There was a joke article parodying "GOTO considered harmful" by suggesting a "COME FROM" command. But in a lot of always, that's exactly what many modern frameworks and languages aim for.
However, in a world where you have to do concurrent blocking/locking code without the help of rigorous compiler-enforced ownership semantics, Elixir/Erlang is like water in the desert.
Of course not, but usually that's not what happens, instead a process crashes because some condition was not considered, the corresponding request is aborted, and a supervisor restarts the process (or doesn't because the acceptor spawns a process per request / client).
Or a long-running worker got into an incorrect state and crashed, and a supervisor will restart it in a known good state (that's a pretty common thing to do in hardware, BEAM makes that idiomatic in software).
If there aren't any good states then the program straight up doesn't work in the first place, which gets diagnosed pretty quickly before it hits the field.
> your work needs to be correct more than it needs to be available.
"correctness over availability" tends to not be a thing, if you assume you can reach perfect and full correctness then either you never release or reality quickly proves you wrong in the field. So maximally resilient and safe systems generally plan for errors happening and how to recover from them instead of assuming they don't. There are very few fully proven non-trivial programs, and there were even less 40 years ago.
And Erlang / BEAM was designed in a telecom context, so availability is the prime directive. Which is also why distribution is built-in: if you have a single machine and it crashes you have nothing.
But a supervisor also sets limits, like “10 restarts in a timespan of 1 second.” Once the limits are reached, the supervisor crashes. Supervisors have supervisors.
In this scenario the fault cascades upward through the system, triggering more broad restarts and state-reinitializations until the top-level supervisor crashes and takes the entire system down with it.
An example might bee losing a connection to the database. It’s not an expected fault to fail while querying it, so you let it crash. That kills the web request, but then the web server ends up crashing too because too many requests failed, then a task runner fails for similar reasons. The logger is still reporting all this because it’s a separate process tree, and the top-level app supervisor ends up restarting the entire thing. It shuts everything off, tries to restart the database connection, and if that works everything will continue, but if not, the system crashes completely.
Expected faults are not part of “let it crash.” E.g. if a user supplies a bad file path or network resource. The distinction is subjective and based around the expectations of the given app. Failure to read some asset included in the distribution is both unlikely and unrecoverable, so “let it crash” allows the code to be simpler in the happy path without giving up fault handling or burying errors deeper into the app or data.
But of course if it crashes because you are reading a file that does not exist it doesnt solve the issue (but it avoids crashing the whole system).
Also, restarting endlessly is just one strategy between multiple others.
If the problem persists, a larger part of the supervision tree is restarted. This eventually leads to a crash of the full application, if nothing can proceed without this application existing in the Erlang release.
The key point is that there's a very large class of errors which is due to the concurrent interaction of different parts of the system. These problems often go away on the next try, because the risk of them occurring is low.
Exploitation of vulnerabilities isn’t always 100% reliable. Heap grooming might be limited or otherwise inadequate.
A quick automatic restart keeps them in business without any other human interaction involved.
1. Detect crashes at runtime and by default stop/crash to prevent continuing with invalid program state
2. Detect crashes at runtime and handle them according to the business context (e.g. crash or retry or fallback-to or ...) to prevent bad UX through crashes.
3. Detect potential crashes at compile-time to prevent the dev from forgetting to handle them according to the business context
4. Don't just detect the possibility of crashes but also the specific type and context to prevent the dev from making a logical mistake and causing a potential runtime error during error handling according to the business context
An example for stage 4 would be that the compiler checks that a fall-back option will actually always resolve the errors and not potentially introduce a new error / error type. Such as falling back to another URL does not actually always resolve the problem, there still needs to be handling for when the request to the alternative URL fails.
The philosophy described in the article is basically just stage 1 and a (partial) default restart instead of a default crash, which is maybe a slight improvement but not really sufficient, at least not by my personal standards.
> Through a process known as Eval-Rinse-Reload-And-Repeat, FuckItJS repeatedly compiles your code, detecting errors and slicing those lines out of the script. To survive such a violent process, FuckItJS reloads itself after each iteration, allowing the onerror handler to catch every single error in your terribly written code.
> [...]
> This will keep evaluating your code until all errors have been sliced off like mold on a piece of perfectly good bread. Whether or not the remaining code is even worth executing, we don't know. We also don't particularly care.
I just threw up in my mouth when I read this. I've never used this language so maybe my experience doesn't apply here but I'm imagining all the different security implications that ive seen arise from failing to check error codes.
Elixir is not about willingly ignoring error codes or failure scenarios. It is about naturally limiting the blast radius of errors without a need to program defensively (as in writing code for scenarios you don’t know “just in case”).
1: https://dashbit.co/blog/writing-assertive-code-with-elixir
ok = whatever().
If whatever is successful and idomatic, it returns ok, or maybe a tuple of {ok, SomeReturn}. In that case, execution would continue. If it returns an error tuple like {error, Reason}... "Let it crash" says you can just let it crash... You didn't have anything better to do, the built in crash because {error, Reason} will do fine.Or you could do a
case whatever of
ok -> ok;
{error, nxdomain} -> ok
end.
If it was fine to get nxdomain error, but any other error isn't acceptable... It will just crash, and that's good or at least ok. Better than having to enumerate all the possible errors, or having a catchall that then explicitly throws an eeror. It's especially hard to enumerate all possible errors because the running system can change and may return a new error that wasn't enumerated when the requesting code was written.There's lots of places where crashing isn't actually what you want, and you have to capture all errors, explicitly log it, and then move on... But when you can, checking for success or success and a handful of expected and recoverable errors is very nice.
This is a common misunderstanding because unfortunately the slogan is frequently misinterpreted.
So being too eager to "just crash" may turn a scenario where you fail to serve 1% of requests into a scenario where you serve none because all your processes keep restarting.
the article, if you should choose to read it, is explaining that people have the misconception you appear to be having due to the 'let it fail' catchphrase. it goes into detail about this system, when failing is appropriate, and when trying to work around errors is appropriate.
as erlang uses greenthreads, restarting a thread for a user API is effectively instant and free.
Supervisors themselves form a tree, so for a crash to take down the whole app, it needs to propagate all the way to the top.
Another explanation for people familiar with exceptions in other languages: "Don't try to catch the exception inside a request handler".
You still want those processes to crash though, as it allows it to automatically clean up any concurrent work. For example, if during a request you start three processes to do concurrent work, like fetching APIs, then the request process crashes, the concurrent processes are automatically cleaned up.
In phoenix each request has its own process and crashing that process will result in a 500 being sent to the client.
- Fundamental/Fatal error: something without the process cannot function, e.g. we are missing an essential config option. Exiting with an error is totally adequate. You can't just heal from that as it would involve guessing information you don't have. Admins need to fix it
- Critical error: something that should not ever occur, e.g. having an active user without password and email. You don't exit, you skip it if thst is possible and ensure the first occurance is logged and admins are contacted
- Expected/Regular error: something that is expected to happen during the normal operations of the service, e.g. the other server you make requests to is being restarted and thus unreachable. Here the strategy may vary, but it could be something like retrying with random exponential backoff. Or you could briefly accept the values provided by that server are unknown and periodically retry to fill the unknown values. Or you could escalate that into a critical error after a certain amount of retries.
- Warnings: These are usually about something being not exactly ideal, but do not impede with the flow of the program at all. Usually has to do with bad data quality
If you can proceed without degrading the integrity of the system you should, the next thing is to decide jow important it is for humans to hear about it.
GNU/MIT/Lisp -> Detect, offer a fix, continue.
It must be compulsory lecture for anybody interested in reliable systems, even if they do not use the BEAM VM.
https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A104...
- Failures are inevitabe, so systems must be designed to EXPECT and recover from them, NOT AVOID them completely.
- Let it crash philosophy allows components to FAIL and RECOVER quickly using supervision trees.
- Processes should be ISOLATED and communicate via MESSAGE PASSING, which prevents cascading failures.
- Supervision trees monitor other processes and RESTART them when they fail, creating a self-healing architecture.
Railway orientated programming to the rescue?
One is to build multiple function heads that pattern match on the arguments. If it’s an error tuple, pass it along. Build up your pipeline and handle any errors at the end.
Another is to use the `with else`[0] expression for building up a railroad. This has the benefit of not having to teach your functions how to pass along errors. Error handling in the else block can be a little gnarly.
I find it a little more manual than languages that have a `runEffect` or compose operator. In large part that’s due to the :ok, :error tuples being more of a convention than a primitive like Either/Result.
0: https://elixirschool.com/en/lessons/basics/control_structure...
If they just mean "processes should be restartable" then that sounds way more reasonable. Similar idea to this but less fancy: https://flawless.dev/
It's a pretty terrible slogan if it makes your language sound worse than it actually is.
It is about self-healing, too.
Imagine that you’re trying to access an API, which for some reason fails.
“Let it crash” isn’t an argument against handling the timeout, but rather that you should only retry a few, bounded times rather than (eg) exponentially back off indefinitely.
When you design from that perspective, you just fail your request processing (returning the request to the queue) and make that your manager’s problem. Your managing process can then restart you, reassign the work to healthy workers, etc. If your manager can’t get things working and the queue overflows, it throws it into dead letters and crashes. That might restart the server, it might page oncall, etc.
The core idea is that within your business logic is the wrong place to handle system health — and that many problems can be solved by routing around problems (ie, give task to a healthy worker) or restarting a process. A process should crash when it isn’t scoped to handle the problem it’s facing (eg, server OOM, critical dependency offline, bad permissions). Crashing escalates the problem until somebody can resolve it.
For example, imagine you're working with a 3rd party API and, according to the documentation, it is supposed to return responses in a certain format. What if suddenly that API stops working? Or what if the format changes?
You could write code to handle that "what if" scenario, but then trying to handle every hypothetical your code becomes bloated, more complicated, and hard to understand.
So in these cases, you accept that the system will crash. But to ensure reliability, you don't want to bring down the whole system. So there are primitives that let you control the blast radius of the crash if something unexpected happens.
Let it crash does not mean you skip validating user input. Those are issues that you expect to happen. You handle those just as you would in any programming language.
It can't work in the general case because replaying a sequence of syscalls is not sufficient to put the machine back in the same state as it was last time. E.g. second time around open behaves differently so you need to follow the error handling.
However sometimes that approach would work. I wonder how wide the area of effective application is. It might be wide enough to be very useful. The all or nothing database transaction model fits it well.
In order to “let it crash”, we must design the system in a way that crashes would not be catastrophic, stability wise. Letting it crash is not a commandment, though: it is a reminder that, in most cases, a smart healing strategy might be overkill.
Of course Joe Armstrong could explain what I meant, but in a much better way: https://erlang.org/pipermail/erlang-questions/2003-March/007... (edit: see the "Why was error handling designed like this?" part for reference)
My personal interpretation is that systems must be able to handle crashing processes gracefully. There is no benefit in letting processes crash just for the sake of it.
Saying "let it crash is a tagline that actually means something else because the BEAM is supposed to be used in this particular way" sounds slightly "cargo-cultish", to the point where we have to challenge the meaning of the actual word to make sense of it.
Joe Armstrong's e-mail, on the other hand, says (and I paraphrase): "the BEAM was designed from the ground up to help developers avoid the creation of ad-hoc protocols for process communication, and the OTP takes that into consideration already. Make sure your system, not your process, is resilient, and literally let processes crash." Boom. There is no gotcha there. Also, there is the added benefit that developers for other platforms now understand that the rationale is justified by the way BEAM/OTP were designed and may not be applicable to their own platforms.
I agree on the importance of defining terms, and I think the important thing here is that "process" in Joe's parlance is not an OS level process, it is one of a fleet of processes running inside the BEAM VM. And the "system" in this case is the supervisory system around it, which itself consists of individual processes.
I'm critiquing a common misunderstanding of the phrase "Let it crash", whereby effectively no local error handling is performed. This leads to worse user experiences and worse outcomes in general. I understand that you're offering critique, but it again sounds like you're critiquing a reductive element (the headline itself).
A web service returning a 500 error code is a lot more obvious than a 200 with an invalid payload. A crashed app with a stack trace is easier to debug and will cause more user feedback than an app than hangs in a retry loop.
When I had to deal with these things in the Java world, it meant not blindly handling or swallowing exceptions that business code had no business caring about. Does your account management code really think it knows how to properly handle an InterruptedException? Unless your answer is rollback and reset the interrupted flag it’s probably wrong. Can’t write a test for a particular failure scenario? That better blow up loudly with enough context that makes it possible to understand the error condition (and then write a test for it).
A 'crash' in most other language/ecosystem means likely a catastrophic failure of the application, ending in a core dump.
Erlang's error handling is way more nuanced than that blunt phrase indicates.
>How does our philosophy of handling errors fit in with coding practices? What kind of code must the programmer write when they find an error? The philosophy is let some other process fix the error, but what does this mean for their code? The answer is let it crash. By this I mean that in the event of an error, then the program should just crash. But what is an error? For programming purpose we can say that:
>• exceptions occur when the run-time system does not know what to do.
>• errors occur when the programmer doesn’t know what to do.
>If an exception is generated by the run-time system, but the programmer had foreseen this and knows what to do to correct the condition that caused the exception, then this is not an error. For example, opening a file which does not exist might cause an exception, but the programmer might decide that this is not an error. They therefore write code which traps this exception and takes the necessary corrective action.
>Errors occur when the programmer does not know what to do. Programmers are supposed to follow specifications, but often the specification does not say what to do and therefore the programmer does not know what to do.
>[...]
>The defensive code detracts from the pure case and confuses the reader—the diagnostic is often no better than the diagnostic which the compiler supplies automatically.
Note that this "program" is a process. For a process doing work, encountering something it can't handle is an error per the above definitions, and the process should just die, since there's nothing better for it to do; for a supervisor process supervising such processes-doing-work, "my child process exited" is an exception at worst, and usually not even an exception since the standard library supervisor code already handles that.
In a more conventional language where concurrency is relatively expensive, and assuming you're not an idiot who writes 1-10k SLOC functions, you end up with functions that have a "single responsibility" (maybe not actually a single responsibility, but closer to it than having 100 duties in one function) near the bottom of your call tree, but they all exist in one thread of execution. In a system, hypothetical, created in this model if your lowest level function is something like:
retrieve_data(db_connection, query_parameters) -> data
And the database connection fails, would you attempt to restart the database connection in this function? Maybe, but that'd be bad design. You'd most likely raise an exception or change the signature so you could express an error return, in Rust and similar it would become something like: retrieve_data(db_connection, query_parameters) -> Result<data, error>
Somewhere higher in the call stack you have a handler which will catch the exception or process the error and determine what to do. That is, the function `retrieve_data` crashes, it fails to achieve its objective and does not attempt any corrective action (beyond maybe a few retries in case the error is transient).In Erlang, you have a supervision tree which corresponds to this call tree concept but for processes. The process handling data retrieval, having been given some db_conn handler and the parameters, will fail for some reason. Instead of handling the error in this process, the process crashes. The failure condition is passed to the supervisor which may or may not have a handler for this situation.
You might put the simple retry policy in the supervisor (that basic assumption of transient errors, maybe a second or third attempt will succeed). It might have other retry policies, like trying the request again but with a different db_connection (that other one must be bad for some reason, perhaps the db instance it references is down). If it continues to fail, then this supervisor will either handle the error some other way (signaling to another process that the db is down, fix it or tell the supervisor what to do) or perhaps crash itself. This repeats all the way up the supervision tree, ultimately it could mean bringing down the whole system if the error propagates to a high enough level.
This is conceptually no different than how errors and exceptions are handled in sequential, non-concurrent systems. You have handlers that provide mechanisms for retrying or dealing with the errors, and if you don't the error is propagated up (hopefully you don't continue running in a known-bad state) until it is handled or the program crashes entirely.
In languages that offer more expensive concurrency (traditional OS threads), the cost of concurrency (in memory and time) means you end up with a policy that sits somewhere between Erlang's and a straight-line sequential program. Your threads will be larger than Erlang processes so they'll include more error handling within themselves, but ultimately they can still fail and you'll have a supervisor of some sort that determines what happens next (hopefully).
As more languages move to cheap concurrency (Go's goroutines, Java's virtual threads), system designs have a chance to shift closer to Erlang than that straight-line sequential approach if people are willing to take advantage of it.
bgdkbtv•6mo ago