How to Test the Reliability of Durable Execution

Testing software at scale is difficult because the surface area to test is so large. There is a near-infinite space of potential test cases arising from specific failures happening across different processes at specific timings.

In this post, we want to share our learnings and best practices from testing large-scale deployments of DBOS, a lightweight durable workflow library we’re building to help developers write reliable, fault-tolerant software. Given the complexity of the problem space, a good testing strategy can’t solely rely on a fixed set of unit and integration tests (although those remain useful). Instead, tests must explore the space of possible failures, either mathematically or through carefully induced randomness. Some of the most prominent techniques the industry has developed include:

Formal verification: This approach involves writing a formal specification in a language like TLA+ and proving that it satisfies desired properties. It’s mathematically rigorous and has been used successfully at scale by companies like Amazon (see this paper or this blog post from Marc Brooker).

Unfortunately, formal verification remains prohibitively difficult to adopt. Maintaining both formal specs and code implementations adds significant development overhead, and there's no good way to formally model interactions with complex real-world dependencies like Postgres (which is core to DBOS). Moreover, formal verification only goes so far: even if you prove your spec is correct, there's no guarantee your code correctly implements it.

Deterministic Simulation Testing: This relatively new strategy involves building a testing harness that can run an entire system deterministically on a single thread (for more detail, see this blog post from Phil Eaton). That way, it is possible to fully control event ordering and perform reproducible fault injections. Deterministic simulation is challenging to implement, but recent work has made it approachable (for example, using the Antithesis deterministic hypervisor).

One major challenge in adopting deterministic simulating testing is dealing with external dependencies. You can’t directly test how your system integrates with other systems–instead, you have to mock your dependencies. However, mocking complex dependencies like Postgres will almost certainly miss some behaviors and obscure critical bugs, making this strategy problematic for any system that depends on them. That said, there’s been some recent work (particularly from Antithesis) on better simulation of external dependencies, so this is a space worth watching.

Chaos Testing: This approach tests a system by running complex real workloads while randomly injecting faults such as server crashes, network delays, disconnections, or latency spikes. Some prominent examples of chaos testing at scale include Netflix’s simian army (co-developed by DBOS CEO Jeremy Edberg) and the famous Jepsen tests for databases.

At DBOS, we heavily rely on chaos testing because it stresses the system under realistic conditions and directly tests the critical interactions between DBOS and Postgres.

How Chaos Tests Work

Our chaos test suite launches hundreds of distributed processes and subjects them to complex workloads that exercise various DBOS features, including durable workflows, queues, notifications, and more. While these workloads run, the test suite randomly injects failures to simulate real-world issues, including:

  • Random process crashes and restarts.
  • Unexpected database disconnections and crashes.
  • Rolling out new application versions mid-flight, with updated code running alongside older versions.

The chaos tests verify that DBOS maintains key properties despite unpredictable failures. Some of the core invariants we validate include:

  • Workflow durability. Every submitted workflow eventually completes, despite process failures, restarts, and database disconnections.
  • Queue reliability. All enqueued workflows eventually dequeue and complete, despite process failures, restarts, and database disconnections.
  • Queue flow control. Processes always respect queue concurrency and flow control limits, never running more workflows than allowed, even if workflows are repeatedly interrupted and restarted.
  • Message delivery. If a notification message is successfully sent to a workflow, the workflow always receives the message, even if repeatedly interrupted and restarted.
  • Version correctness. Workflows are never dequeued or recovered to processes running different application versions, even when processes running multiple different application versions are concurrently active.
  • Timeout enforcement. Workflow timeouts are always respected, even if a workflow is repeatedly interrupted and restarted across different processes.

To make chaos tests robust, it's critical to design them to validate only these system-wide invariants, not test-specific assertions. For example, a conventional integration test might send an HTTP request to a server to start a workflow, validate that it returns a 200 response, then later send another HTTP request to check if the workflow completed and validate that it also returns a 200 response. However, a chaos test making such checks would have a high false positive rate, as servers might crash mid-request.

Instead of checking whether individual requests succeed or fail, a chaos test should validate the underlying invariant: that a successfully started workflow eventually completes. For example, the chaos test could request a server start a workflow with a specific ID, retrying many times until the request succeeds. Then the test should poll the server to check the workflow's status until either the server confirms the workflow has succeeded or a sufficiently long timeout has passed. Such tests might use a utility like retry_until_success, which retries the assertion of an invariant hundreds of times until it either passes or times out. Checks like this ensure tests only fail when an invariant is broken, not due to transient issues.

Anatomy of a Bug

Rigorous chaos testing has helped uncover several subtle and hard-to-reproduce bugs. Here's a concrete example of a bug we caught and fixed thanks to chaos testing.

The symptom: Occasionally, a workflow would get stuck while waiting for a notification using DBOS.recv(). The notification was successfully sent and recorded in the database, but the workflow never woke up and consumed it.

The investigation: Digging deeper, we noticed this freeze only happened immediately after a database disconnection. DBOS would recover the pending workflow automatically, but sometimes the workflow would still hang. It was rare because most of the time the workflow resumed as expected, but it pointed us toward a possible race condition involving database disconnects in DBOS.recv().

The root cause: The issue was in how DBOS.recv() handled concurrent executions of the same workflow. To avoid duplicate execution, it used a map of condition variables to coordinate across concurrent executions. When two instances of the same workflow attempted to receive the same notification, one would acquire the condition variable and wait for the notification, while the other would detect the conflict and wait for the first to finish.

The problematic code looked like this:

The bug only occurred if a database disconnected precisely while this code was running: the SQL query (c.execute()) would throw an exception, causing the function to exit early without releasing or removing the condition variable from the map. When the workflow later recovered, it would find the lingering condition variable, assume another execution was still active, and block indefinitely, waiting for the original (now-failed) execution to finish.

The fix: Once we understood the problem, the solution was straightforward: wrap the entire code block in a try-finally to ensure the condition variable is always cleaned up, even if an error occurs:

This is exactly the kind of bug that chaos testing is designed to catch—subtle, hard to reproduce, and nearly impossible to find with unit or integration tests alone, as it only occurs when a database disconnection happens at a precise moment. The original fix can be found in this PR. Note that the chaos tests focusing specifically on database disconnections are just a small part of our overall chaos test suite.

Learn More about Durable Execution

If you're building reliable systems or just enjoy uncovering weird edge cases, we’d love to hear from you. At DBOS, our goal is to make durable workflows as lightweight and easy to work with as possible. Check it out: