> How I solved a distributed queue problem after 15 years
Well… how? The post is a nice description of durable queues, but it never explicitly says they’re a solution to a distributed queue problem, nor does it specifically define such a problem.
Is “durable queue” a brand name of a DBOS feature? Because the post doesn’t even say “and here’s how you can use DBOS for durable queues,” nor does it compare it to Kafka or any other “durable queue” solution that’s emerged in the fifteen years since the author used RabbitMQ… (btw, isn’t RMQ a durable queue…?)
> "Durable queues were rare when I was at Reddit, but they’re more and more popular now. Essentially, they work by combining task queues with durable workflows, helping you reliably orchestrate workflows of many parallel tasks."
That makes it sound like task queues + durable workflows = durable queues, but that's not true at all. A durable queue is literally a queue that doesn't drop messages e.g. during an unexpected shutdown. That's all. Durable workflows are a pretty different thing. A durable queue could be used just like a normal queue, but while you can't build a durable workflow on a normal queue (or at least, it would be a huge pain), a durable queue makes it vastly simpler to build a durable workflow engine. I think the article talks about durable workflows because this is DBOS, a company looking to sell durable workflow services, but also because durable workflows are considered by many to be a kind of "holy grail" of big-business applications as they seem like they can allow for you to write code that's kind of "always running", where the state in memory is persisted to a DB invisibly so that you have to think less about CRUD. The killer app of durable workflows seems to me to writing orchestration code for really long-running processes which have to do lots of distributed stuff, as it allows you to write mostly normal looking code which does things like "wait for this thing to finish, even if that thing will be finished in a week", which is a pretty cool thing to see.
What are durable workflows? On the technical side, I'd describe durable workflows as being more like a system of cooperative multitasking where you serialize your state/inputs/outputs to a durable store at each yield/suspension point. Since you're tracking state at yield points and not at the individual instruction level, the workflow engine tracks work state less granularly than traditional single-process computing. Due to the more coarse tracking of units of execution, I think of durable workflows more like async runtimes which serialize their progress. The hidden downside to durable workflows then is that it means you have to write odd-looking code to fit into that custom async runtime. For example, since the unit of execution is coarse, you have to assume the code between the checkpoints could potentially be run multiple times if e.g. a worker only gets halfway through executing the next "chunk" but shuts down unexpectedly before finishing. Thus you have to assume at-least-once execution instead of our typical "exactly once" execution when thinking about single lines of code. Additionally, since while some languages are built to support custom async runtimes, even the ones which do don't have sufficient flexibility to allow language-level support for the extremely weird distributed async runtimes you'd need to build a durable workflow engine. Because of that, once you get down to it you're basically going to have build your code out of callbacks that you register with the custom workflow-engine library of the provider you're using. This is the biggest wart of building on durable workflow platforms, as they pretty much all have you write code that looks like this:
function do_thing(foo): bar {} // turns foo into bar, w/ side effects
var result bar // result is type bar
result = worfklow.execute(do_thing, fooInput)
// In "normal" non-durable-workflow code
// you'd instead just call do_thing() like:
result = do_thing(fooInput)
That's another detail left out of the parent article: there are actually a ton of these durable workflow platforms, of which DBOS is only one. I think the biggest in the online space is probably Temporal (the one I'm currently using at $DAYJOB), but there's others as well. Here's a short list.- Temporal https://temporal.io/
- DBOS https://www.dbos.dev/
- Inngest https://www.inngest.com/uses/durable-workflows
- Restate https://restate.dev/
Anyway, thanks for coming to my TED talk, I hope you've learned about this fascinating developing corner of software, and I can't wait for someone to build first-party language support for pluggable durable execution runtimes into languages we like. Then we can get rid of callback nonsense and start a whole NEW hype cycle around this technology!
The reason none of the others were mentioned is because they all work very differently than DBOS. All of those others require an external durability coordinator, and require you to rewrite your application to work around how they operate.
DBOS is a library that does its durability work in process and uses the application database to store the durability state. This means the latency is much smaller, and the reliability is much higher because there aren't extra moving parts in the critical path that can go down.
Here is a page about this difference: https://docs.dbos.dev/architecture
You mention that you don't have to rewrite your application to work around how DBOS operates. That seems somewhat true, but I think DBOS still requires folks to rewrite their code around a custom runtime. Looking at the Python code on your home page, it seems like you're leveraging Python's decorators to make the "glue code" less prominent (registering functions with the async executor, telling the async system to invoke certain registered functions), but the glue code is still there. If I go look at the DBOS library for Golang[1] for example, since Golang doesn't have decorators in the same way Python does, we still have to have code doing the kind of "manual callback" style I mentioned:
// code is massively paraphrased for brevity, err checks removed
func workflow(dbosCtx dbos.DBOSContext, _ string) (string, error) {
_, err := dbos.RunAsStep(dbosCtx, func(ctx) (string, error) { return stepOne(ctx) })
return dbos.RunAsStep(dbosCtx, func(ctx) (string, error) { return stepTwo(ctx) })
}
func main() {
// Initialize a DBOS context
dctx, err := dbos.NewDBOSContext(dbos.Config{ DatabaseURL: "...", AppName: "myapp", })
// Register a workflow
dbos.RegisterWorkflow(dctx, workflow)
// Launch DBOS
err = dctx.Launch()
defer dctx.Cancel()
// Run a durable workflow and get its result
handle, err := dbos.RunWorkflow(dctx, workflow, "")
res, err := handle.GetResult()
fmt.Println("Workflow result:", res)
}
I don't think that's a bad thing though, I think that's a good thing. I feel like positioning DBOS as a _library_ is an excellent choice, it's a huge ergonomics improvement. The choices so far seem like you're trying to make DBOS easy to adopt via appropriate amounts of convenience features, but not so much automagic that we-the-devs can't reason about what's going on. With developer reasoning in mind, I have some more questions for you!In the architecture page you linked[2], you talk about versioning. Versioning with durable workflows is one of those super-annoying things which affect the entire paradigm, albiet only once you've already adopted the tech and start having to change/evolve/maintain workflows. In that doc, you say that with DBOS each application will only work on workflows started by application versions which match the current application version. For completing long-running workflows, the page says:
> To safely recover workflows started on an older version of your code, you should start a process running that code version.
Since one of the killer apps of durable workflows are, as I mentioned, typically long-running jobs, do you have any products/advice/documentation for this pattern of running multiple application versions, and how one might approach implementing this practice? If we're writing code which takes a week to complete and may exit and recover many times before finally completing, do you have advice on how to keep each version deployed till all the work for a version is completed? Looking at Temporal, when using their Worker versioning scheme they offer ways for users to look this information up in Temporal, but not much guidance on actually implementing the pattern. Looking at the DBOS docs about versioning, I see information about getting this information via e.g. Conductor, but I also do not see any info about actually implementing multiple-concurrent-worker-version deployment (which Temporal calls "rainbow deployments"). Is version management something y'all are thinking about improving the ergonomics of, in the same way you improved ergonomics by bringing the executor in-process?
Speaking about versioning, how does DBOS handle cases around bugfix versions? Where you deploy a version A, but A has a bug in it. You would like to make the fix and deploy that fix as version B, then ideally, run the remaining workflows for Version A using the code in Version B. It seems like "version forking"[3] is the only way to do this, but it also seems like it's a special operation that cannot be done via a code change; it must be done via the Conductor administration UI. Is there no way to do in-code version patching[4] like is done in Temporal?
Finally, what are the limits to usage of DBOS? As in, where does DBOS start to fall down? Are there guidelines on the maximum number of steps in a workflow before things start to get tricky? What about the maximum serialized size of the workflow/step parameters? I've been unable to find any of that information on your website.
Thanks for making such an interesting piece of technology, and thanks for answering questions!
[1] - https://github.com/dbos-inc/dbos-transact-golang
[2] - https://docs.dbos.dev/architecture
[3] - https://docs.dbos.dev/production/self-hosting/workflow-manag...
[4] - https://docs.temporal.io/develop/go/versioning#patching
For versioning, we recommend keeping each version running until all workflows on that version are done. It's similar to a blue-green deployment: each process is tagged with one version, and all workflows in it share that version. You can list pending/enqueued workflows on the old version (UI or list_workflow programmatic API), and once that list drains, you can shut down the old processes. DBOS Cloud automates this, and we'll add more guidance for self-hosting.
For bugfixes, DBOS supports programmatic forking and other workflow management tools [1]. We deliberately don't support code patching because it's fragile and hard to test. For example, patches can pile up on long-running workflows and make debugging painful.
The main limit is the database (which you can control the size). DBOS writes workflow inputs, step outputs, and workflow outputs to it. There's no step limit beyond disk space. Postgres/SQLite allow up to 1 GB per field, but keeping inputs/outputs under ~2 MB helps performance. We'll add clearer guidelines to the docs.
Thanks again for all the thoughtful questions!
[1] https://docs.dbos.dev/python/reference/contexts#fork_workflo...
I also now have the dreadful notion of debugging a non-deterministic deadlock or race condition in a workflow that takes a week to run!
Also, check out the sibling comment for more information about durability.
Instead of forcing you into a custom async runtime, DBOS lets you keep writing normal functions (this is an example in Python):
@DBOS.workflow()
def do_thing(foo):
return bar
# You can still call the workflow function like this:
result = do_thing(fooInput)
Under the hood, DBOS checkpoints inputs/outputs so it can recover after failure, but you don't have to restructure your code around callbacks. In Python and Java we use decorators/annotations so registration feels natural, while in Go/TypeScript there's a lightweight one-time registration step. Either way, you keep the synchronous call style you'd expect.On top of that, DBOS also supports running workflows asynchronously or through queues, so you can start with a simple function call and later scale out to async/queued execution without changing your code. That's what the article was leading into.
Can you explain what makes DBOS better to use in Golang vs Temporal?
"since Golang doesn't have decorators in the same way Python does, we still have to have code doing the kind of "manual callback" style I mentioned"
That's exactly right, specifically for steps. We considered other ways to wrap the workflow calls (so you don't have to do dbos.RunWorkflow(yourFunction)), but they got in the way of providing compile time type checking.
As Qian said, under the hood the Golang SDK is an embedded orchestration package that just requires Postgres to automate state management.
For example, check the RunWorkflow implementation: https://github.com/dbos-inc/dbos-transact-golang/blob/0afae2...
It does all the durability logic in-line with your code and doesn't rely on an external service.
Thanks for taking the time to share your insights! This was one of the most interesting HN comment I've seen in a while :)
More recently discussed are OCaml's effect system, or Flix programming language.
No AI was used in creating this content. Your accusation of me being soulless warrants further investigation though.
I had an exponential socket growth issue in the late 1990s being the architect and maintainer of a little known ecommerce company that was powering a good portion of the domestic U.S. This volume problem I had equated to drinking from a fire hose that cannot be turned off yet the hose had to somehow be upgraded in motion to handle even more volume. I was young then and greatly inexperienced to truly comprehend the task I had to solve for and as the volume grew ever so slightly daily so too did the response times grow by milliseconds on transactions that few could see but all internally could understand what it meant over time. We managed to acquire the latest multi SMP hardware from Compaq but even this did not solve for the challenges of the increasing volume. I did solve the issue and it involved my third complete rewrite of the entire software stack into what is now recognized as microservices and durable message queueing. Those past choices, to this day, continue to stand the test of time.
md3911027514•7h ago
KraftyOne•6h ago
dbacar•5h ago
chatmasta•4h ago
There’s nothing inherently different about the durability of Postgres that makes it better than Kafka for implementing durable workflows. There are many reasons it’s a better choice for building a system like DBOS to implement durable workflows – ranging from ergonomics to ecosystem compatibility. But in theory you could build the same solution on Kafka, and if the company were co-founded by the Kafka creators rather than Michael Stonebraker, maybe they would have chosen that.
zerotolerance•6h ago
nosefrog•5h ago
zerotolerance•17m ago
That is sort of danced around a bit in this article where the author is talking about dropped messages, etc. It is tempting to say "use a stream server" but ultimately stream servers make head-of-line accounting the consumer's responsibility. That's usually solved with some kind of (not distributed) lock.