A lightweight, production-ready Durable Execution Engine for Java that enables fault-tolerant workflow orchestration. Workflows can be interrupted at any point—due to crashes, restarts, or failures—and seamlessly resume from the exact point of interruption without re-executing completed steps.
- Overview
- Key Features
- Architecture
- Getting Started
- API Reference
- How It Works
- Configuration
- Testing
- Contributing
- License
Traditional applications lose all in-memory state when a process crashes or restarts. For long-running workflows with side effects (API calls, database writes, notifications), this means either:
- Re-executing already-completed operations (potentially causing duplicates)
- Building complex manual checkpointing logic
Durable Execution solves this by automatically persisting workflow state at each step. This engine implements the durable execution pattern pioneered by systems like Temporal, DBOS, and Azure Durable Functions.
- Saga Patterns: Multi-step transactions across microservices
- Long-Running Workflows: Order processing, user onboarding, ETL pipelines
- Retry Logic: Automatic recovery from transient failures
- Audit Requirements: Complete execution history with step-level granularity
| Feature | Description |
|---|---|
| Step Memoization | Completed steps are persisted and replayed on restart—never re-executed |
| Parallel Execution | Native support for concurrent steps via CompletableFuture |
| Automatic Sequencing | Deterministic step identification for loops and conditionals |
| Zombie Step Detection | Handles crashes between execution and commit |
| Type-Safe API | Full generic support with compile-time type checking |
| Thread-Safe Persistence | Concurrent writes handled with proper synchronization |
| Zero External Dependencies | Embedded SQLite—no external services required |
| Category | Implementation | Benefit |
|---|---|---|
| Thread Safety | Fair ReentrantLock for all write operations |
Prevents race conditions in concurrent workflows |
| Performance | HikariCP connection pooling + SQLite WAL mode | Optimized for high-throughput concurrent access |
| Resilience | Exponential backoff retry for SQLITE_BUSY errors |
Handles transient database lock contention |
| Fault Tolerance | Two-phase commit (PENDING → COMPLETED) | Detects and recovers from zombie steps |
| Observability | SLF4J logging with detailed step execution traces | Production-ready debugging and monitoring |
| Type Safety | Generic-based API with compile-time checks | Eliminates runtime type errors |
┌─────────────────────────────────────────────────────────────┐
│ Workflow Code │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Step 1 │──│ Step 2 │──│ Step 3 │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└───────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ DurableContext │
│ • Step execution with memoization │
│ • Sequence tracking (logical clock) │
│ • Async step orchestration │
└───────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Persistence Layer │
│ • SQLite with WAL mode │
│ • HikariCP connection pooling │
│ • Retry logic with exponential backoff │
└─────────────────────────────────────────────────────────────┘
src/
├── main/java/com/durableengine/
│ ├── engine/
│ │ ├── DurableContext.java # Core execution context
│ │ ├── WorkflowRunner.java # Workflow lifecycle management
│ │ ├── StepRecord.java # Step persistence model
│ │ ├── StepStatus.java # Execution state enum
│ │ ├── StepException.java # Step failure handling
│ │ ├── persistence/
│ │ │ ├── StepRepository.java # Repository interface
│ │ │ └── SQLiteStepRepository.java
│ │ └── serialization/
│ │ └── JsonSerializer.java # JSON serialization
│ ├── examples/
│ │ └── onboarding/
│ │ └── OnboardingWorkflow.java
│ └── App.java # CLI application
└── test/java/com/durableengine/engine/
├── DurableContextTest.java # Unit tests
├── WorkflowRunnerTest.java # Integration tests
└── ConcurrencyTest.java # Concurrency stress tests
- Java 17 or higher
- Maven 3.6+
# Clone the repository
git clone https://github.com/yourusername/durable-execution-engine.git
cd durable-execution-engine
# Build the project
mvn clean package
# Run tests
mvn testtry (WorkflowRunner runner = new WorkflowRunner("./workflow.db")) {
runner.run("order-123", ctx -> {
// Step 1: Create order
String orderId = ctx.step("createOrder", () -> orderService.create());
// Step 2: Process payment
ctx.step("processPayment", () -> paymentService.charge(orderId));
// Step 3: Send confirmation
ctx.step("sendConfirmation", () -> emailService.notify(orderId));
});
}If the process crashes after Step 2, restarting will:
- Skip Step 1 (replay cached result)
- Skip Step 2 (replay cached result)
- Execute Step 3 (continue from where it stopped)
The core abstraction for durable execution:
// Execute a step with explicit ID
<T> T step(String id, Callable<T> fn)
// Execute a step with automatic ID generation
<T> T step(Callable<T> fn)Parameters:
id— Unique identifier for this step within the workflowfn— The operation to execute (or replay from cache)
Behavior:
- If step was previously completed → returns cached result
- If step is new → executes, persists result, returns value
- If step is pending (zombie) → re-executes safely
Execute steps concurrently while maintaining durability:
runner.run("parallel-workflow", ctx -> {
// Launch parallel steps
CompletableFuture<String> taskA = ctx.stepAsync("fetchUserData", () ->
userService.fetch(userId));
CompletableFuture<String> taskB = ctx.stepAsync("fetchOrderHistory", () ->
orderService.getHistory(userId));
// Wait for both
String userData = taskA.join();
String orderHistory = taskB.join();
// Continue with sequential step
ctx.step("generateReport", () ->
reportService.generate(userData, orderHistory));
});For simpler workflows, let the engine generate step IDs automatically:
runner.run("auto-id-workflow", ctx -> {
// IDs generated from stack trace + sequence counter
String a = ctx.step(() -> computeA());
String b = ctx.step(() -> computeB());
// Works correctly in loops
for (int i = 0; i < items.size(); i++) {
final int index = i;
ctx.step(() -> processItem(items.get(index)));
}
});WorkflowRunner runner = new WorkflowRunner("./workflow.db");
// Start or resume a workflow
WorkflowResult result = runner.run("workflow-id", ctx -> { ... });
// Check result
if (result.success()) {
System.out.println("Workflow completed");
} else {
System.out.println("Failed at: " + result.failedStepId());
}
// Reset workflow (delete all persisted steps)
runner.reset("workflow-id");
// Clean up
runner.close();Handling loops and conditionals requires deterministic step identification:
for (int i = 0; i < 3; i++) {
ctx.step("process", () -> doWork(i));
}The engine maintains per-step-ID counters:
| Execution | Step Key |
|---|---|
| 1st iteration | process |
| 2nd iteration | process:1 |
| 3rd iteration | process:2 |
On restart, the same counter logic ensures steps replay in the correct order.
Concurrent step execution requires careful synchronization:
- Connection Pooling — HikariCP manages database connections
- WAL Mode — SQLite Write-Ahead Logging enables concurrent reads during writes
- Write Serialization — Fair
ReentrantLockprevents write conflicts - Retry Logic — Exponential backoff handles transient
SQLITE_BUSYerrors
private <T> T executeWithRetry(SqlOperation<T> operation) {
int attempt = 0;
long delay = 50; // milliseconds
while (attempt < MAX_RETRIES) {
try {
return operation.execute();
} catch (SQLException e) {
if (isBusyError(e)) {
Thread.sleep(delay);
delay *= 2; // Exponential backoff
attempt++;
} else {
throw e;
}
}
}
throw new RuntimeException("Max retries exceeded");
}A "zombie step" occurs when execution completes but persistence fails (crash before commit).
Solution: Two-Phase Persistence
- Before execution: Write
PENDINGrecord - After execution: Update to
COMPLETEDwith result
On restart, PENDING steps are detected and re-executed:
switch (existingStep.status()) {
case COMPLETED -> return deserialize(existingStep.output());
case FAILED -> throw new StepException(existingStep.error());
case PENDING -> return reExecuteStep(stepKey, fn); // Zombie detected
}CREATE TABLE steps (
id INTEGER PRIMARY KEY AUTOINCREMENT,
workflow_id TEXT NOT NULL,
step_key TEXT NOT NULL,
status TEXT NOT NULL, -- PENDING, COMPLETED, FAILED
output TEXT, -- JSON-serialized result
error_message TEXT,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
UNIQUE(workflow_id, step_key)
);
CREATE INDEX idx_steps_workflow_id ON steps(workflow_id);The default configuration uses HikariCP with SQLite-optimized settings:
// WAL mode for concurrent access
config.addDataSourceProperty("journal_mode", "WAL");
// 30-second busy timeout
config.addDataSourceProperty("busy_timeout", "30000");
// 2MB cache
config.addDataSourceProperty("cache_size", "-2000");mvn test| Test Suite | Coverage |
|---|---|
DurableContextTest |
Step memoization, replay, failure handling, loops, auto-ID |
WorkflowRunnerTest |
Lifecycle management, resume from checkpoint, reset |
ConcurrencyTest |
Parallel writes, thread safety, stress testing |
# Build the JAR
mvn clean package
# Start a new workflow
java -jar target/durable-execution-engine-1.0.0.jar run onboard-001 "John Doe"
# Simulate crash after step 1
java -jar target/durable-execution-engine-1.0.0.jar run onboard-002 "Jane" --crash-after 1
# Resume (observe steps 1 is skipped)
java -jar target/durable-execution-engine-1.0.0.jar run onboard-002 "Jane"
# View workflow status
java -jar target/durable-execution-engine-1.0.0.jar status onboard-002
# Reset workflow
java -jar target/durable-execution-engine-1.0.0.jar reset onboard-002| Technology | Purpose |
|---|---|
| Java 17 | Records, pattern matching, text blocks |
| SQLite | Embedded persistence (zero configuration) |
| HikariCP | High-performance connection pooling |
| Jackson | JSON serialization/deserialization |
| JUnit 5 | Testing framework |
| SLF4J | Logging abstraction |
| Maven | Build and dependency management |
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License. See LICENSE for details.
This implementation is inspired by the durable execution patterns pioneered by: