Durable Execution Engine

A lightweight, production-ready Durable Execution Engine for Java that enables fault-tolerant workflow orchestration. Workflows can be interrupted at any point—due to crashes, restarts, or failures—and seamlessly resume from the exact point of interruption without re-executing completed steps.

Overview

Traditional applications lose all in-memory state when a process crashes or restarts. For long-running workflows with side effects (API calls, database writes, notifications), this means either:

Re-executing already-completed operations (potentially causing duplicates)
Building complex manual checkpointing logic

Durable Execution solves this by automatically persisting workflow state at each step. This engine implements the durable execution pattern pioneered by systems like Temporal, DBOS, and Azure Durable Functions.

When to Use This

Saga Patterns: Multi-step transactions across microservices
Long-Running Workflows: Order processing, user onboarding, ETL pipelines
Retry Logic: Automatic recovery from transient failures
Audit Requirements: Complete execution history with step-level granularity

Key Features

Feature	Description
Step Memoization	Completed steps are persisted and replayed on restart—never re-executed
Parallel Execution	Native support for concurrent steps via `CompletableFuture`
Automatic Sequencing	Deterministic step identification for loops and conditionals
Zombie Step Detection	Handles crashes between execution and commit
Type-Safe API	Full generic support with compile-time type checking
Thread-Safe Persistence	Concurrent writes handled with proper synchronization
Zero External Dependencies	Embedded SQLite—no external services required

Non-Functional Features

Category	Implementation	Benefit
Thread Safety	Fair `ReentrantLock` for all write operations	Prevents race conditions in concurrent workflows
Performance	HikariCP connection pooling + SQLite WAL mode	Optimized for high-throughput concurrent access
Resilience	Exponential backoff retry for `SQLITE_BUSY` errors	Handles transient database lock contention
Fault Tolerance	Two-phase commit (PENDING → COMPLETED)	Detects and recovers from zombie steps
Observability	SLF4J logging with detailed step execution traces	Production-ready debugging and monitoring
Type Safety	Generic-based API with compile-time checks	Eliminates runtime type errors

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Workflow Code                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   Step 1    │──│   Step 2    │──│   Step 3    │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    DurableContext                           │
│  • Step execution with memoization                          │
│  • Sequence tracking (logical clock)                        │
│  • Async step orchestration                                 │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                   Persistence Layer                         │
│  • SQLite with WAL mode                                     │
│  • HikariCP connection pooling                              │
│  • Retry logic with exponential backoff                     │
└─────────────────────────────────────────────────────────────┘

Project Structure

src/
├── main/java/com/durableengine/
│   ├── engine/
│   │   ├── DurableContext.java       # Core execution context
│   │   ├── WorkflowRunner.java       # Workflow lifecycle management
│   │   ├── StepRecord.java           # Step persistence model
│   │   ├── StepStatus.java           # Execution state enum
│   │   ├── StepException.java        # Step failure handling
│   │   ├── persistence/
│   │   │   ├── StepRepository.java   # Repository interface
│   │   │   └── SQLiteStepRepository.java
│   │   └── serialization/
│   │       └── JsonSerializer.java   # JSON serialization
│   ├── examples/
│   │   └── onboarding/
│   │       └── OnboardingWorkflow.java
│   └── App.java                      # CLI application
└── test/java/com/durableengine/engine/
    ├── DurableContextTest.java       # Unit tests
    ├── WorkflowRunnerTest.java       # Integration tests
    └── ConcurrencyTest.java          # Concurrency stress tests

Getting Started

Prerequisites

Java 17 or higher
Maven 3.6+

Installation

# Clone the repository
git clone https://github.com/yourusername/durable-execution-engine.git
cd durable-execution-engine

# Build the project
mvn clean package

# Run tests
mvn test

Quick Example

try (WorkflowRunner runner = new WorkflowRunner("./workflow.db")) {
    runner.run("order-123", ctx -> {
        // Step 1: Create order
        String orderId = ctx.step("createOrder", () -> orderService.create());
        
        // Step 2: Process payment
        ctx.step("processPayment", () -> paymentService.charge(orderId));
        
        // Step 3: Send confirmation
        ctx.step("sendConfirmation", () -> emailService.notify(orderId));
    });
}

If the process crashes after Step 2, restarting will:

Skip Step 1 (replay cached result)
Skip Step 2 (replay cached result)
Execute Step 3 (continue from where it stopped)

API Reference

Step Primitive

The core abstraction for durable execution:

// Execute a step with explicit ID
<T> T step(String id, Callable<T> fn)

// Execute a step with automatic ID generation
<T> T step(Callable<T> fn)

Parameters:

id — Unique identifier for this step within the workflow
fn — The operation to execute (or replay from cache)

Behavior:

If step was previously completed → returns cached result
If step is new → executes, persists result, returns value
If step is pending (zombie) → re-executes safely

Parallel Execution

Execute steps concurrently while maintaining durability:

runner.run("parallel-workflow", ctx -> {
    // Launch parallel steps
    CompletableFuture<String> taskA = ctx.stepAsync("fetchUserData", () -> 
        userService.fetch(userId));
    
    CompletableFuture<String> taskB = ctx.stepAsync("fetchOrderHistory", () -> 
        orderService.getHistory(userId));
    
    // Wait for both
    String userData = taskA.join();
    String orderHistory = taskB.join();
    
    // Continue with sequential step
    ctx.step("generateReport", () -> 
        reportService.generate(userData, orderHistory));
});

Automatic Step ID Generation

For simpler workflows, let the engine generate step IDs automatically:

runner.run("auto-id-workflow", ctx -> {
    // IDs generated from stack trace + sequence counter
    String a = ctx.step(() -> computeA());
    String b = ctx.step(() -> computeB());
    
    // Works correctly in loops
    for (int i = 0; i < items.size(); i++) {
        final int index = i;
        ctx.step(() -> processItem(items.get(index)));
    }
});

Workflow Management

WorkflowRunner runner = new WorkflowRunner("./workflow.db");

// Start or resume a workflow
WorkflowResult result = runner.run("workflow-id", ctx -> { ... });

// Check result
if (result.success()) {
    System.out.println("Workflow completed");
} else {
    System.out.println("Failed at: " + result.failedStepId());
}

// Reset workflow (delete all persisted steps)
runner.reset("workflow-id");

// Clean up
runner.close();

How It Works

Sequence Tracking

Handling loops and conditionals requires deterministic step identification:

for (int i = 0; i < 3; i++) {
    ctx.step("process", () -> doWork(i));
}

The engine maintains per-step-ID counters:

Execution	Step Key
1st iteration	`process`
2nd iteration	`process:1`
3rd iteration	`process:2`

On restart, the same counter logic ensures steps replay in the correct order.

Thread Safety

Concurrent step execution requires careful synchronization:

Connection Pooling — HikariCP manages database connections
WAL Mode — SQLite Write-Ahead Logging enables concurrent reads during writes
Write Serialization — Fair ReentrantLock prevents write conflicts
Retry Logic — Exponential backoff handles transient SQLITE_BUSY errors

private <T> T executeWithRetry(SqlOperation<T> operation) {
    int attempt = 0;
    long delay = 50; // milliseconds
    
    while (attempt < MAX_RETRIES) {
        try {
            return operation.execute();
        } catch (SQLException e) {
            if (isBusyError(e)) {
                Thread.sleep(delay);
                delay *= 2;  // Exponential backoff
                attempt++;
            } else {
                throw e;
            }
        }
    }
    throw new RuntimeException("Max retries exceeded");
}

Zombie Step Detection

A "zombie step" occurs when execution completes but persistence fails (crash before commit).

Solution: Two-Phase Persistence

Before execution: Write PENDING record
After execution: Update to COMPLETED with result

On restart, PENDING steps are detected and re-executed:

switch (existingStep.status()) {
    case COMPLETED -> return deserialize(existingStep.output());
    case FAILED    -> throw new StepException(existingStep.error());
    case PENDING   -> return reExecuteStep(stepKey, fn);  // Zombie detected
}

Configuration

Database Schema

CREATE TABLE steps (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    workflow_id TEXT NOT NULL,
    step_key TEXT NOT NULL,
    status TEXT NOT NULL,          -- PENDING, COMPLETED, FAILED
    output TEXT,                   -- JSON-serialized result
    error_message TEXT,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL,
    UNIQUE(workflow_id, step_key)
);

CREATE INDEX idx_steps_workflow_id ON steps(workflow_id);

Connection Pool Settings

The default configuration uses HikariCP with SQLite-optimized settings:

// WAL mode for concurrent access
config.addDataSourceProperty("journal_mode", "WAL");

// 30-second busy timeout
config.addDataSourceProperty("busy_timeout", "30000");

// 2MB cache
config.addDataSourceProperty("cache_size", "-2000");

Testing

Run All Tests

mvn test

Test Coverage

Test Suite	Coverage
`DurableContextTest`	Step memoization, replay, failure handling, loops, auto-ID
`WorkflowRunnerTest`	Lifecycle management, resume from checkpoint, reset
`ConcurrencyTest`	Parallel writes, thread safety, stress testing

Manual Testing with CLI

# Build the JAR
mvn clean package

# Start a new workflow
java -jar target/durable-execution-engine-1.0.0.jar run onboard-001 "John Doe"

# Simulate crash after step 1
java -jar target/durable-execution-engine-1.0.0.jar run onboard-002 "Jane" --crash-after 1

# Resume (observe steps 1 is skipped)
java -jar target/durable-execution-engine-1.0.0.jar run onboard-002 "Jane"

# View workflow status
java -jar target/durable-execution-engine-1.0.0.jar status onboard-002

# Reset workflow
java -jar target/durable-execution-engine-1.0.0.jar reset onboard-002

Technologies

Technology	Purpose
Java 17	Records, pattern matching, text blocks
SQLite	Embedded persistence (zero configuration)
HikariCP	High-performance connection pooling
Jackson	JSON serialization/deserialization
JUnit 5	Testing framework
SLF4J	Logging abstraction
Maven	Build and dependency management

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License. See LICENSE for details.

Acknowledgments

This implementation is inspired by the durable execution patterns pioneered by:

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Prompts.txt		Prompts.txt
README.md		README.md
pom.xml		pom.xml

License

AjayKumbham/native-durable-execution-engine

Folders and files

Latest commit

History

Repository files navigation

Durable Execution Engine

Table of Contents

Overview

When to Use This

Key Features

Non-Functional Features

Architecture

Project Structure

Getting Started

Prerequisites

Installation

Quick Example

API Reference

Step Primitive

Parallel Execution

Automatic Step ID Generation

Workflow Management

How It Works

Sequence Tracking

Thread Safety

Zombie Step Detection

Configuration

Database Schema

Connection Pool Settings

Testing

Run All Tests

Test Coverage

Manual Testing with CLI

Technologies

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages