Skip to content

Durable Execution Engine for Java enabling fault-tolerant workflows with step memoization and safe crash recovery

License

Notifications You must be signed in to change notification settings

AjayKumbham/native-durable-execution-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Durable Execution Engine

Java Build License

A lightweight, production-ready Durable Execution Engine for Java that enables fault-tolerant workflow orchestration. Workflows can be interrupted at any point—due to crashes, restarts, or failures—and seamlessly resume from the exact point of interruption without re-executing completed steps.


Table of Contents


Overview

Traditional applications lose all in-memory state when a process crashes or restarts. For long-running workflows with side effects (API calls, database writes, notifications), this means either:

  • Re-executing already-completed operations (potentially causing duplicates)
  • Building complex manual checkpointing logic

Durable Execution solves this by automatically persisting workflow state at each step. This engine implements the durable execution pattern pioneered by systems like Temporal, DBOS, and Azure Durable Functions.

When to Use This

  • Saga Patterns: Multi-step transactions across microservices
  • Long-Running Workflows: Order processing, user onboarding, ETL pipelines
  • Retry Logic: Automatic recovery from transient failures
  • Audit Requirements: Complete execution history with step-level granularity

Key Features

Feature Description
Step Memoization Completed steps are persisted and replayed on restart—never re-executed
Parallel Execution Native support for concurrent steps via CompletableFuture
Automatic Sequencing Deterministic step identification for loops and conditionals
Zombie Step Detection Handles crashes between execution and commit
Type-Safe API Full generic support with compile-time type checking
Thread-Safe Persistence Concurrent writes handled with proper synchronization
Zero External Dependencies Embedded SQLite—no external services required

Non-Functional Features

Category Implementation Benefit
Thread Safety Fair ReentrantLock for all write operations Prevents race conditions in concurrent workflows
Performance HikariCP connection pooling + SQLite WAL mode Optimized for high-throughput concurrent access
Resilience Exponential backoff retry for SQLITE_BUSY errors Handles transient database lock contention
Fault Tolerance Two-phase commit (PENDING → COMPLETED) Detects and recovers from zombie steps
Observability SLF4J logging with detailed step execution traces Production-ready debugging and monitoring
Type Safety Generic-based API with compile-time checks Eliminates runtime type errors

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Workflow Code                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   Step 1    │──│   Step 2    │──│   Step 3    │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    DurableContext                           │
│  • Step execution with memoization                          │
│  • Sequence tracking (logical clock)                        │
│  • Async step orchestration                                 │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                   Persistence Layer                         │
│  • SQLite with WAL mode                                     │
│  • HikariCP connection pooling                              │
│  • Retry logic with exponential backoff                     │
└─────────────────────────────────────────────────────────────┘

Project Structure

src/
├── main/java/com/durableengine/
│   ├── engine/
│   │   ├── DurableContext.java       # Core execution context
│   │   ├── WorkflowRunner.java       # Workflow lifecycle management
│   │   ├── StepRecord.java           # Step persistence model
│   │   ├── StepStatus.java           # Execution state enum
│   │   ├── StepException.java        # Step failure handling
│   │   ├── persistence/
│   │   │   ├── StepRepository.java   # Repository interface
│   │   │   └── SQLiteStepRepository.java
│   │   └── serialization/
│   │       └── JsonSerializer.java   # JSON serialization
│   ├── examples/
│   │   └── onboarding/
│   │       └── OnboardingWorkflow.java
│   └── App.java                      # CLI application
└── test/java/com/durableengine/engine/
    ├── DurableContextTest.java       # Unit tests
    ├── WorkflowRunnerTest.java       # Integration tests
    └── ConcurrencyTest.java          # Concurrency stress tests

Getting Started

Prerequisites

  • Java 17 or higher
  • Maven 3.6+

Installation

# Clone the repository
git clone https://github.com/yourusername/durable-execution-engine.git
cd durable-execution-engine

# Build the project
mvn clean package

# Run tests
mvn test

Quick Example

try (WorkflowRunner runner = new WorkflowRunner("./workflow.db")) {
    runner.run("order-123", ctx -> {
        // Step 1: Create order
        String orderId = ctx.step("createOrder", () -> orderService.create());
        
        // Step 2: Process payment
        ctx.step("processPayment", () -> paymentService.charge(orderId));
        
        // Step 3: Send confirmation
        ctx.step("sendConfirmation", () -> emailService.notify(orderId));
    });
}

If the process crashes after Step 2, restarting will:

  1. Skip Step 1 (replay cached result)
  2. Skip Step 2 (replay cached result)
  3. Execute Step 3 (continue from where it stopped)

API Reference

Step Primitive

The core abstraction for durable execution:

// Execute a step with explicit ID
<T> T step(String id, Callable<T> fn)

// Execute a step with automatic ID generation
<T> T step(Callable<T> fn)

Parameters:

  • id — Unique identifier for this step within the workflow
  • fn — The operation to execute (or replay from cache)

Behavior:

  • If step was previously completed → returns cached result
  • If step is new → executes, persists result, returns value
  • If step is pending (zombie) → re-executes safely

Parallel Execution

Execute steps concurrently while maintaining durability:

runner.run("parallel-workflow", ctx -> {
    // Launch parallel steps
    CompletableFuture<String> taskA = ctx.stepAsync("fetchUserData", () -> 
        userService.fetch(userId));
    
    CompletableFuture<String> taskB = ctx.stepAsync("fetchOrderHistory", () -> 
        orderService.getHistory(userId));
    
    // Wait for both
    String userData = taskA.join();
    String orderHistory = taskB.join();
    
    // Continue with sequential step
    ctx.step("generateReport", () -> 
        reportService.generate(userData, orderHistory));
});

Automatic Step ID Generation

For simpler workflows, let the engine generate step IDs automatically:

runner.run("auto-id-workflow", ctx -> {
    // IDs generated from stack trace + sequence counter
    String a = ctx.step(() -> computeA());
    String b = ctx.step(() -> computeB());
    
    // Works correctly in loops
    for (int i = 0; i < items.size(); i++) {
        final int index = i;
        ctx.step(() -> processItem(items.get(index)));
    }
});

Workflow Management

WorkflowRunner runner = new WorkflowRunner("./workflow.db");

// Start or resume a workflow
WorkflowResult result = runner.run("workflow-id", ctx -> { ... });

// Check result
if (result.success()) {
    System.out.println("Workflow completed");
} else {
    System.out.println("Failed at: " + result.failedStepId());
}

// Reset workflow (delete all persisted steps)
runner.reset("workflow-id");

// Clean up
runner.close();

How It Works

Sequence Tracking

Handling loops and conditionals requires deterministic step identification:

for (int i = 0; i < 3; i++) {
    ctx.step("process", () -> doWork(i));
}

The engine maintains per-step-ID counters:

Execution Step Key
1st iteration process
2nd iteration process:1
3rd iteration process:2

On restart, the same counter logic ensures steps replay in the correct order.

Thread Safety

Concurrent step execution requires careful synchronization:

  1. Connection Pooling — HikariCP manages database connections
  2. WAL Mode — SQLite Write-Ahead Logging enables concurrent reads during writes
  3. Write Serialization — Fair ReentrantLock prevents write conflicts
  4. Retry Logic — Exponential backoff handles transient SQLITE_BUSY errors
private <T> T executeWithRetry(SqlOperation<T> operation) {
    int attempt = 0;
    long delay = 50; // milliseconds
    
    while (attempt < MAX_RETRIES) {
        try {
            return operation.execute();
        } catch (SQLException e) {
            if (isBusyError(e)) {
                Thread.sleep(delay);
                delay *= 2;  // Exponential backoff
                attempt++;
            } else {
                throw e;
            }
        }
    }
    throw new RuntimeException("Max retries exceeded");
}

Zombie Step Detection

A "zombie step" occurs when execution completes but persistence fails (crash before commit).

Solution: Two-Phase Persistence

  1. Before execution: Write PENDING record
  2. After execution: Update to COMPLETED with result

On restart, PENDING steps are detected and re-executed:

switch (existingStep.status()) {
    case COMPLETED -> return deserialize(existingStep.output());
    case FAILED    -> throw new StepException(existingStep.error());
    case PENDING   -> return reExecuteStep(stepKey, fn);  // Zombie detected
}

Configuration

Database Schema

CREATE TABLE steps (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    workflow_id TEXT NOT NULL,
    step_key TEXT NOT NULL,
    status TEXT NOT NULL,          -- PENDING, COMPLETED, FAILED
    output TEXT,                   -- JSON-serialized result
    error_message TEXT,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL,
    UNIQUE(workflow_id, step_key)
);

CREATE INDEX idx_steps_workflow_id ON steps(workflow_id);

Connection Pool Settings

The default configuration uses HikariCP with SQLite-optimized settings:

// WAL mode for concurrent access
config.addDataSourceProperty("journal_mode", "WAL");

// 30-second busy timeout
config.addDataSourceProperty("busy_timeout", "30000");

// 2MB cache
config.addDataSourceProperty("cache_size", "-2000");

Testing

Run All Tests

mvn test

Test Coverage

Test Suite Coverage
DurableContextTest Step memoization, replay, failure handling, loops, auto-ID
WorkflowRunnerTest Lifecycle management, resume from checkpoint, reset
ConcurrencyTest Parallel writes, thread safety, stress testing

Manual Testing with CLI

# Build the JAR
mvn clean package

# Start a new workflow
java -jar target/durable-execution-engine-1.0.0.jar run onboard-001 "John Doe"

# Simulate crash after step 1
java -jar target/durable-execution-engine-1.0.0.jar run onboard-002 "Jane" --crash-after 1

# Resume (observe steps 1 is skipped)
java -jar target/durable-execution-engine-1.0.0.jar run onboard-002 "Jane"

# View workflow status
java -jar target/durable-execution-engine-1.0.0.jar status onboard-002

# Reset workflow
java -jar target/durable-execution-engine-1.0.0.jar reset onboard-002

Technologies

Technology Purpose
Java 17 Records, pattern matching, text blocks
SQLite Embedded persistence (zero configuration)
HikariCP High-performance connection pooling
Jackson JSON serialization/deserialization
JUnit 5 Testing framework
SLF4J Logging abstraction
Maven Build and dependency management

Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License. See LICENSE for details.


Acknowledgments

This implementation is inspired by the durable execution patterns pioneered by:

About

Durable Execution Engine for Java enabling fault-tolerant workflows with step memoization and safe crash recovery

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages