Dbos: Durable Workflow Orchestration with Go and PostgreSQL

https://github.com/dbos-inc/dbos-transact-golang

102•Bogdanp•4mo ago

Comments

hmaxdml•4mo ago

Thanks for posting! I am one of the author, happy to answer any question!

drakenot•4mo ago

I read the Dbos vs Temporal thing, but can you speak more about if there is a different in durability guarantees?

KraftyOne•4mo ago

The durability guarantees are similar--each workflow step is checkpointed, so if a workflow fails, it can recover from the last completed step.

The big difference, like that blog post (https://www.dbos.dev/blog/durable-execution-coding-compariso...) describes, is the operational model. DBOS is a library you can install into your app, whereas Temporal et al. require you to rearchitect your app to run on their workers and external orchestrator.

dfee•4mo ago

This makes sense, but I wonder if there’s a place for DBOS, then, for each language?

For example, a Rust library. Am I missing how a go library is useful for non-go applications?

KraftyOne•4mo ago

There are DBOS libraries in multiple languages--Python, TS, and Go so far with Java coming soon: https://github.com/dbos-inc

No Rust yet, but we'll see!

saintarian•4mo ago

Great project! Love the library+db approach. Some questions:

1. How much work is it to add bindings for new languages? 2. I know you provide conductor as a service. What are my options for workflow recovery if I don't have outbound network access? 3. Considering this came out of https://dbos-project.github.io/, do you guys have plans beyond durable workflows?

KraftyOne•4mo ago

1. We also have support for Python and TypeScript with Java coming soon: https://github.com/dbos-inc

2. There are built-in APIs for managing workflow recovery, documented here: https://docs.dbos.dev/production/self-hosting/workflow-recov...

3. We'll see! :)

travisgriggs•4mo ago

Elixir? Or does Oban hew close enough, that it’s not worth it?

jiggunjer•4mo ago

Does it natively support job priorities? E.g. if there's 10 workflows submitted and I start up a worker, how does it pick the first job.

KraftyOne•4mo ago

Yeah, queue priority is natively supported: https://docs.dbos.dev/golang/tutorials/queue-tutorial#priori...

rickette•4mo ago

There's a clear text password in one of your GitHub Action workflows: https://github.com/dbos-inc/dbos-transact-golang/blob/main/....

qianli_cs•4mo ago

That password is only used by the GHA to start a local Postgres Docker container (https://github.com/dbos-inc/dbos-transact-golang/blob/main/c...), which is not accessible from outside.

plmpsu•4mo ago

How does DBOS scale in a cluster? with Temporal or Dapr Workflows, applications register running their supported workflows types or activities and the workflow orchestration framework balances work across applications. How does this work in the library approach?

Also, how is DBOS handling workflow versioning?

Looking forward for your Java implementation. Thanks

qianli_cs•4mo ago

Good questions!

DBOS naturally scales to distributed environments, with many processes/servers per application and many applications running together. The key idea is to use the database concurrency control to coordinate multiple processes. [1]

When a DBOS workflow starts, it’s tagged with the version of the application process that launched it. This way, you can safely change workflow code without breaking existing ones. They'll continue running on the older version. As a result, rolling updates become easy and safe. [2]

[1] https://docs.dbos.dev/architecture#using-dbos-in-a-distribut...

[2] https://docs.dbos.dev/architecture#application-and-workflow-...

plmpsu•4mo ago

Thanks for the reply.

So applications continuously poll the database for work? Have you done any benchmarking to evaluate the throughput of DBOS when running many workflows, activities, etc.?

qianli_cs•4mo ago

In DBOS, workflows can be invoked directly as normal function calls or enqueued. Direct calls don't require any polling. For queued workflows, each process runs a lightweight polling thread that checks for new work using `SELECT ... FOR UPDATE SKIP LOCKED` with exponential backoffs to prevent contentions, so many concurrent workers can poll efficiently. We recently wrote a blog post on durable workflows, queues, and optimizations: https://www.dbos.dev/blog/why-postgres-durable-execution

Throughput mainly comes down to database writes: executing a workflow = 2 writes (input + output), each step = 1 write. A single Postgres instance can typically handle thousands of writes per second, and a larger one can handle tens of thousands (or even more, depending on your workload size). If you need more capacity, you can shard your app across multiple Postgres servers.

odie5533•4mo ago

Even though I don't use DBOS, that blog post is gold.

hazn•4mo ago

I remember reading that restate.dev is a 'push' based workflow and therefore works well with serverless workflows: https://news.ycombinator.com/item?id=40660568

what is your input on these two topics? aka pull vs push and working well with serverless workflows

intermerda•4mo ago

I remembered reading about the DBOS paper a while back - https://arxiv.org/abs/2007.11112. Is this an evolution of that research work? If so, how did an OS for databases morph into a workflow orchestration service?

hmaxdml•4mo ago

It is an evolution. The DBOS workflow orchestrator places a DB at the center of your application to handle most of the complicated state management problems.

osigurdson•4mo ago

Did you consider using NATS? While I haven't tried this deployment model, you can embed it in a go program as a library. If you wanted something really minimal this might be an option.

I use NATS to acheive this type of durable processing. It works well. Of course, idempotent code is needed but I don't think this can be avoided.

hmaxdml•4mo ago

We decided to use Postgres because of the relational semantics, the ease of integration with user applications, and it's remarkable popularity

tester54321•4mo ago

Is it possible for you guys to write a blog post analyzing the usage of the DB (reads, writes, what is stored for each workflow any events etc) to help users planning for scale to really understand what they are signing up.

The library seems fantastic but my team did not use this because at scale they believe that the number of DB reads and writes becomes very significant for a large number of workflows with many steps and that with PG vs Cassandra/ScyllaDB it would not be feasible for our throughput. I tried to convince them otherwise but it is difficult to quantify from the current documentation.

hmaxdml•4mo ago

Good call. We'll see how to integrate it in our docs better.

The cost of DBOS durable execution is 1 write per step (checkpoint the outcome) and 2 additional writes per workflows (upsert the workflow status, checkpoint the outcome). The write size is the size of your workflows/steps output.

Postgres can support several thousands writes per seconds (influenced by the write size, ofc): DBOS can thus support several thousands of workflows/steps per second.

Postgres scales remarkably well. In fact, most org will never out scale a single, vertically scaled Postgres instance. There's a very good write up by Figma telling how they scaled Postgres horizontally: https://www.figma.com/blog/how-figmas-databases-team-lived-t...

plmpsu•4mo ago

What did your team decide to go with eventually?

danggit•4mo ago

Looks interesting

chc4•4mo ago

> Exactly-Once Event Processing

This sounds...impossible? If you have some step in your workflow, either you 1) record it as completed when you start, but then you can crash halfway through and when you restore the workflow it now isn't processed 2) record it as completed after you're done, but then you can crash in-between completing and recording and when you restore you run the step twice.

#2 sounds like the obvious right thing to do, and what I assume is happening, but is not exactly once and you'd need to still be careful that all of your steps are idempotent.

KraftyOne•4mo ago

The specific claim is that workflows are started exactly-once in response to an event. This is possible because starting a workflow is a database transaction, so we can guarantee that exactly one workflow is started per (for example) Kafka message.

For step processing, what you say is true--steps are restarted if they crash mid-execution, so they should be idempotent.

reillyse•4mo ago

"Exactly-Once Event Processing" is the headline claim - I actually missed the workflow starting bit. So what happens if the workflow fails? Does it get restarted (and so we have twice-started) or does the entire workflow just fail ? Which is probably better described as "at-most once event processing"

qianli_cs•4mo ago

I think a clearer way to think about this is "at least once" message delivery plus idempotent workflow execution is effectively exactly-once event processing.

The DBOS workflow execution itself is idempotent (assume each step is idempotent). When DBOS starts a workflow, the "start" (workflow inputs) is durably logged first. If the app crashes, on restart, DBOS reloads from Postgres and resumes from the last completed step. Steps are checkpointed so they don't re-run once recorded.

reillyse•4mo ago

Why would u need exactly once semantics if the workflow is idempotent?

You specifically need exactly once when the action you are doing is not idempotent.

bjornsing•4mo ago

"Exactly-Once Event Processing" is possible if (all!) the processing results go into a transactional database along with the stream position marker in a single transaction. That’s probably the mechanism they are relying on.

osigurdson•4mo ago

Exactly once is usually just at least once with a unique message ID and dedupe window.

jiggunjer•4mo ago

Sounds exactly like how Temporal markets itself. I find that the burden of creating idempotent sub-steps in the workflow falls on the developer, regardless of checkpoints and state management at the workflow level.

KraftyOne•4mo ago

Yes, in any durability framework there's still the possibility that a process crashes mid-step, in which case you have no choice but to restart the step.

Where DBOS really shines (vs. Temporal and other workflow systems) is a radically simpler operational model--it's just a library you can install in your app instead of a big heavyweight cluster you have to rearchitect your app to work with. This blog post goes into more detail: https://www.dbos.dev/blog/durable-execution-coding-compariso...

jiggunjer•4mo ago

Oh I see. Seems Nextflow is a strong contender in the serverless orchestrator market (serverless sounds better than embedded).

From what I can tell though, NF just runs a single workflow at a time, no queue or database. It relies on filesystem caching for "durability". That's changing recently with some optional add-ons.

bjornsing•4mo ago

> Yes, in any durability framework there's still the possibility that a process crashes mid-step, in which case you have no choice but to restart the step.

Golem [1] is an interesting counterexample to this. They run your code in a WASM runtime and essentially checkpoint execution state at every interaction with the outside world.

But it seems they are having trouble selling into the workflow orchestration market. Perhaps due to the preconception above? Or are there other drawbacks with this model that I’m not aware of?

1. https://www.golem.cloud/post/durable-execution-is-not-just-f...

qianli_cs•4mo ago

I think one potential concern with "checkpoint execution state at every interaction with the outside world" is the size of the checkpoints. Allowing users to control the granularity by explicitly specifying the scope of each step seems like a more flexible model. For example, you can group multiple external interactions into a single step and only checkpoint the final result, avoiding the overhead of saving intermediate data. If you want finer granularity, you can instead declare each external interaction as its own step.

Plus, if the crash happens in the outside world (where you have no control), then checkpointing at finer granularity won't help.

bjornsing•4mo ago

Sure you get more control with explicit state management. But it’s also more work, and more difficult work. You can do a lot of writes to NVMe for one developer salary.

jedberg•4mo ago

It's not really more work to be explicit about the steps and workflows. You already have to break your code into steps to make your program run. Adding a single decorator isn't much extra work at all.

vineyardmike•4mo ago

That still fundamentally suffers the same idempotency problem as any other system. When interacting with the outside world, you, the developer, need to be idempotent and enforce it.

For example, if you call an API (the outside world) to charge the user’s credit card, and the WASM host fails and the process is restarted, you’ll need to be careful to not charge again. This can happen after the request is issued, but before the response is received/processed.

This is no different than any other workflow library or service.

The WASM idea is interesting, and maybe lets you be more granular in how you checkpoint (eg for complex business logic that is self-contained but expensive to repeat). The biggest win is probably for general preemption or resource management, but those are generally wins for the provider not the user. Also, this requires compiling your application into WASM, which restricts which languages/libraries/etc you can use.

bjornsing•4mo ago

The challenges around idempotency remain to some extent, yes. But you have that problem even in non-workflow code, so the usual patterns will just work with no extra mental effort from the developer.

jedberg•4mo ago

The biggest downsides to their methodology are that the snapshots can get really big really quickly, and that they are hard to introspect since they are binary blobs of memory dumps.

bjornsing•4mo ago

Yeah the whole methodology depends on forgetting about state and treating it as a long-running program. If you need to look at the state then you connect a debugger, etc.

odie5533•4mo ago

For a project with minimal users, we get a lot of DBOS posts.

Jtsummers•4mo ago

Not really.

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu... - Not even 3 full pages worth over the past 5 years, though the first page is entirely from this year. It's maybe 2-3 a month on average this year, and a lot are dupes.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... - Nim, for comparison, which doesn't really make a dent in the programming world but shows up a lot. The first 15 pages covers the same time period.

KraftyOne•4mo ago

We may be a small startup, but we're growing fast with no shortage of production users who love our tech: https://www.dbos.dev/customer-stories

krashidov•4mo ago

Why such negativity? What have you shipped?

anbotero•4mo ago

I think this post is riddled with bots. One is marketing another competing offer, others complaining about the UI component not being open source too...

cpursley•4mo ago

pgglow is also worth looking at, all the work happens in Postgres:

https://www.pgflow.dev

They use supabase (demo functions) as client but could be anything.

cpursley•4mo ago

pgglow is also worth looking at, all the work happens in Postgres:

https://www.pgflow.dev

They use supabase (demo functions) as client but could be your language of choice.

cpursley•4mo ago

pgglow is also worth looking at, all the work happens in Postgres:

https://www.pgflow.dev

They use supabase (demo functions) as client but could be your language of choice.

lordofgibbons•4mo ago

This is perfect. I've been growing weary of Go projects that can be used as a library, but for some strange reason must to be used as an external service - complicating the infrastructure. Looking forward to using this in a future project.

atombender•4mo ago

I've played with DBOS and like it, but I have to say the lack of a control plane and UI in the open source version means it may have a hard time competing with Temporal, which provides both.

Being able to see the state of workflows and their histories is a key part of having an application in production. Without a control plane, my understanding is that DBOS can't offer the same kind of failure recovery as Temporal, though it's unclear to me how the "Transact" engine does this.

A big benefit that Temporal's architecture provides is separation of concerns. Temporal can coordinate workflows across many apps, whereas with DBOS each app (as far as I understand it, at least) is a silo managing its own queues.

hmaxdml•4mo ago

DBOS stores all the workflow metadata in postgres, which is readily queryable for observability. We've recently seen a user setup an entire Grafana dashboard to observe their numerous workflows.

A postgres server can host many databases, and multiple applications can use the same server. The same dashboard can be used to monitor them all.

With respect to recovery: A new Transact process will run a round of recovery at startup. Transact also exposes an admin server with a recovery endpoint.

For more elaborate scenarios, we have control plane options commercially available.

atombender•4mo ago

Temporal has a full-fledged UI where I can drill down into individual workflow runs and see the graph of activities, detail logs, retry counts, inputs and outputs, and so on. Temporal also has an API to introspect this without reaching into a database.

You can share a database server with DBOS, but it's common to give applications dedicated database resources (one Postgres cluster per app in different regions), meaning it won't work with DBOS unless you write your own federated control layer that can speak to multiple instances. Which is also not offered out of the box. Sharing one DBOS-specific server across all apps would introduce a single point of failure.

Again, I like DBOS, but right now the value proposition isn't that great given that Temporal has already nailed this.

KraftyOne•4mo ago

DBOS also has a full-fledged workflow visualization and management UI: https://docs.dbos.dev/golang/tutorials/workflow-management

atombender•4mo ago

Not in the open source version? It requires the commercial Conductor thing.

d0100•4mo ago

Is there a nice interface to visualize all workers/workflows running like temporal has?

Even better if the interface is also embeddable into a go http handler

KraftyOne•4mo ago

Yes, there's a full workflow visualization/management interface (not embeddable though): https://docs.dbos.dev/golang/tutorials/workflow-management

Crypto Deposit Frauds

Substack makes money from hosting Nazi newsletters

Framing an LLM as a safety researcher changes its language, not its judgement

Are there anyone interested about a creator economy startup

Show HN: Skill Lab – CLI tool for testing and quality scoring agent skills

2003: What is Google's Ultimate Goal? [video]

Roger Ebert Reviews "The Shawshank Redemption"

Busy Months in KDE Linux

Zram as Swap

Green’s Dictionary of Slang - Five hundred years of the vulgar tongue

Nvidia CEO Says AI Capital Spending Is Appropriate, Sustainable

Show HN: StyloShare – privacy-first anonymous file sharing with zero sign-up

Part 1 the Persistent Vault Issue: Your Encryption Strategy Has a Shelf Life

Show HN: Teleop_xr – Modular WebXR solution for bimanual robot teleoperation

The Highest Exam: How the Gaokao Shapes China

Open-source framework for tracking prediction accuracy

India's Sarvan AI LLM launches Indic-language focused models

Show HN: CryptoClaw – open-source AI agent with built-in wallet and DeFi skills

ShowHN: Make OpenClaw respond in Scarlett Johansson’s AI Voice from the Film Her

CReact Version 0.3.0 Released

Show HN: CReact – AI Powered AWS Website Generator

The rocky 1960s origins of online dating (2025)

Show HN: Agent-fetch – Sandboxed HTTP client with SSRF protection for AI agents

Why there is no official statement from Substack about the data leak

Effects of Zepbound on Stool Quality

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Crypto Deposit Frauds

Substack makes money from hosting Nazi newsletters

Framing an LLM as a safety researcher changes its language, not its judgement

Are there anyone interested about a creator economy startup

Show HN: Skill Lab – CLI tool for testing and quality scoring agent skills

2003: What is Google's Ultimate Goal? [video]

Roger Ebert Reviews "The Shawshank Redemption"

Busy Months in KDE Linux

Zram as Swap

Green’s Dictionary of Slang - Five hundred years of the vulgar tongue

Nvidia CEO Says AI Capital Spending Is Appropriate, Sustainable

Show HN: StyloShare – privacy-first anonymous file sharing with zero sign-up

Part 1 the Persistent Vault Issue: Your Encryption Strategy Has a Shelf Life

Show HN: Teleop_xr – Modular WebXR solution for bimanual robot teleoperation

The Highest Exam: How the Gaokao Shapes China

Open-source framework for tracking prediction accuracy

India's Sarvan AI LLM launches Indic-language focused models

Show HN: CryptoClaw – open-source AI agent with built-in wallet and DeFi skills

ShowHN: Make OpenClaw respond in Scarlett Johansson’s AI Voice from the Film Her

CReact Version 0.3.0 Released

Show HN: CReact – AI Powered AWS Website Generator

The rocky 1960s origins of online dating (2025)

Show HN: Agent-fetch – Sandboxed HTTP client with SSRF protection for AI agents

Why there is no official statement from Substack about the data leak

Effects of Zepbound on Stool Quality

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Dbos: Durable Workflow Orchestration with Go and PostgreSQL

Comments