Skip to content

Latest commit

 

History

History
421 lines (352 loc) · 19.9 KB

File metadata and controls

421 lines (352 loc) · 19.9 KB

Architecture Overview

System Design

This library provides multiple JSON streaming parser implementations, each optimized for different use cases:

  1. JSONParser (TypeScript Transform Stream) - For Node.js streams
  2. Native Parser (C++ Background Thread) - For file descriptors
  3. Worker Parser (JavaScript Worker Thread) - Pure JS alternative

Data Flow Architectures

JSONParser (Stream-Based)

┌─────────────────────────────────────────────────────────┐
│ Node.js Stream (TCP, stdin, pipe, etc.)                │
│                                                         │
│  Data: "{\"foo\":1}\n{\"bar\":2}\n"                    │
└────────────────────┬────────────────────────────────────┘
                     │
                     │ .pipe()
                     ▼
┌─────────────────────────────────────────────────────────┐
│ JSONParser (Transform Stream)                           │
│                                                         │
│  _transform(chunk, encoding, cb) {                      │
│    // Split by delimiter                                │
│    // Parse each JSON string                            │
│    // Push parsed objects                               │
│  }                                                      │
└────────────────────┬────────────────────────────────────┘
                     │
                     │ .on('data')
                     ▼
┌─────────────────────────────────────────────────────────┐
│ JavaScript Event Loop                                   │
│                                                         │
│  parser.on('data', (obj) => {                           │
│    // Process parsed object                             │
│  });                                                    │
└─────────────────────────────────────────────────────────┘

Characteristics:
- ✅ Works with any Node.js stream
- ✅ Simple, pure JavaScript
- ✅ Fast when main thread is idle
- ❌ Blocks main thread during parsing
- ❌ Performance degrades under load

Native Parser (Direct FD Access)

┌─────────────────────────────────────────────────────────┐
│ Node.js Main Thread                                     │
│                                                         │
│  const fd = fs.openSync('/path/to/file', 'r');         │
│  const parser = createJsonParserNativeFromFd(fd);      │
│                                                         │
│  ┌─────────────────────────────────────────────────┐   │
│  │ Native Addon (N-API)                           │   │
│  │                                                 │   │
│  │  1. Receives fd from JS                        │   │
│  │  2. Duplicates: fd_dup = dup(fd)               │   │
│  │  3. Starts C++ background thread               │   │
│  └─────────────────────────────────────────────────┘   │
│                          │                              │
│                          │ fd_dup                       │
│                          ▼                              │
└─────────────────────────────────────────────────────────┘
                          │
                          │ Direct syscall
                          ▼
┌─────────────────────────────────────────────────────────┐
│ C++ Background Thread (std::thread)                    │
│                                                         │
│  while (!stop) {                                        │
│    // Direct read from kernel                           │
│    ssize_t n = read(fd_dup, buf, BUF_SZ);              │
│                                                         │
│    // Process data in C++                               │
│    // Split by delimiter                                │
│    // Prepare batches                                   │
│                                                         │
│    // Send to main thread via TSFN                      │
│    napi_call_threadsafe_function(tsfn, batch);         │
│  }                                                      │
│                                                         │
│  Data path: Kernel → C++ buffer → Zero-copy → JS       │
└────────────────────┬────────────────────────────────────┘
                     │
                     │ Thread-Safe Function (TSFN)
                     │ Zero-copy external buffers
                     ▼
┌─────────────────────────────────────────────────────────┐
│ Node.js Main Thread (TSFN Callback)                    │
│                                                         │
│  call_js_from_tsfn_with_instance(env, cb, ctx, data) { │
│    // Receive batch from C++ thread                     │
│    // If passRawBuffers: true                          │
│    //   - Buffers are zero-copy (external buffers)     │
│    //   - Parse with V8's JSON.parse()                  │
│    // If passRawBuffers: false                          │
│    //   - Convert C++ JValue to JS objects             │
│    // Emit 'data' events                                │
│  }                                                      │
└────────────────────┬────────────────────────────────────┘
                     │
                     │ .on('data')
                     ▼
┌─────────────────────────────────────────────────────────┐
│ Application Code                                        │
│                                                         │
│  parser.on('data', (obj) => {                           │
│    // Process parsed object                             │
│  });                                                    │
└─────────────────────────────────────────────────────────┘

Characteristics:
- ✅ Direct kernel access (no Node.js stream layer)
- ✅ Background I/O thread (doesn't block main thread)
- ✅ Zero-copy buffers (efficient data transfer)
- ✅ Resilient under load (only 1.5x slower at 90% CPU)
- ✅ Works with file descriptors (files, stdin, sockets)
- ❌ Requires native addon build
- ❌ Only works with file descriptors (not all streams)

Worker Parser (JavaScript Worker Thread)

┌─────────────────────────────────────────────────────────┐
│ Node.js Main Thread                                     │
│                                                         │
│  const fd = fs.openSync('/path/to/file', 'r');         │
│  const parser = createJsonParserWorkerFromFd(fd);      │
│                                                         │
│  ┌─────────────────────────────────────────────────┐   │
│  │ Worker Thread Manager                           │   │
│  │                                                 │   │
│  │  1. Spawns worker thread                        │   │
│  │  2. Passes fd to worker                         │   │
│  │  3. Sets up postMessage handlers                │   │
│  └─────────────────────────────────────────────────┘   │
│                          │                              │
│                          │ fd                           │
│                          ▼                              │
└─────────────────────────────────────────────────────────┘
                          │
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│ Worker Thread (Separate V8 Isolate)                    │
│                                                         │
│  // json-parser-worker-thread.ts                       │
│  while (true) {                                         │
│    // Read from fd using fs.readSync()                  │
│    const n = fs.readSync(fd, buf, {...});              │
│                                                         │
│    // Split by delimiter                                │
│    // Parse JSON in worker thread                       │
│    const parsed = JSON.parse(candidate);                │
│                                                         │
│    // Add to batch (complete POJSOs)                    │
│    batch.push(parsed);                                  │
│                                                         │
│    // Send to main thread via postMessage               │
│    parentPort!.postMessage({                            │
│      type: 'data',                                      │
│      batch: batch  // Structured cloning                │
│    });                                                  │
│  }                                                      │
│                                                         │
│  Data path: Kernel → Worker → Structured Clone → Main  │
└────────────────────┬────────────────────────────────────┘
                     │
                     │ postMessage() (structured cloning)
                     │ Full object graph serialization
                     ▼
┌─────────────────────────────────────────────────────────┐
│ Node.js Main Thread (Message Handler)                  │
│                                                         │
│  worker.on('message', (msg) => {                        │
│    if (msg.type === 'data') {                           │
│      // Objects arrive fully parsed                     │
│      // Structured cloning reconstructs object graph   │
│      this.pending.push(...msg.batch);                   │
│      // Emit 'data' events                              │
│    }                                                    │
│  });                                                    │
└────────────────────┬────────────────────────────────────┘
                     │
                     │ .on('data')
                     ▼
┌─────────────────────────────────────────────────────────┐
│ Application Code                                        │
│                                                         │
│  parser.on('data', (obj) => {                           │
│    // Process parsed object                             │
│  });                                                    │
└─────────────────────────────────────────────────────────┘

Characteristics:
- ✅ Pure JavaScript (no native addon)
- ✅ Parsing offloaded to worker thread
- ✅ Complete objects arrive (no re-parsing needed)
- ❌ Structured cloning overhead (serialization/deserialization)
- ❌ Slower than native parser (especially for nested objects)
- ❌ More memory overhead (object graph copying)

Threading Models

Native Parser Threading

Main Thread (Node.js)          Background Thread (C++)
─────────────────────          ────────────────────────
                               
Create parser                  ──┐
  │                              │
  ├─> dup(fd)                    │
  ├─> Start std::thread          │
  │                              │
  │                              ├─> read(fd_dup, buf)
  │                              ├─> Process data
  │                              ├─> Prepare batch
  │                              │
  │<── TSFN callback ────────────┘
  │   (zero-copy buffers)
  │
  ├─> JSON.parse() (if passRawBuffers)
  ├─> Emit 'data' events
  │
  └─> Application receives objects

Worker Parser Threading

Main Thread (Node.js)          Worker Thread (V8 Isolate)
─────────────────────          ───────────────────────────
                               
Create parser                  ──┐
  │                              │
  ├─> Spawn worker_threads       │
  │                              │
  │                              ├─> fs.readSync(fd, buf)
  │                              ├─> JSON.parse()
  │                              ├─> Create POJSOs
  │                              │
  │<── postMessage() ────────────┘
  │   (structured cloning)
  │
  ├─> Receive objects (already parsed)
  ├─> Emit 'data' events
  │
  └─> Application receives objects

Performance Characteristics

Idle Main Thread

Parser Time (5K objects) Throughput Notes
JSONParser ~16ms 312,500 obj/sec Fastest - no thread overhead
Native-optimized ~21ms 238,000 obj/sec Thread overhead, but zero-copy
Worker ~31ms 161,000 obj/sec Structured cloning overhead

Under Load (50% CPU)

Parser Slowdown Notes
JSONParser ~2-3x Main thread blocked by parsing
Native-optimized ~1.37x Background I/O helps
Worker ~1.2-1.5x Parsing offloaded to worker

Under Load (90% CPU)

Parser Slowdown Notes
JSONParser ~4-5x Severe degradation
Native-optimized ~1.51x Resilient under load
Worker ~1.3-1.8x Good but structured cloning overhead

Memory Management

Native Parser (Zero-Copy)

// C++ side: Allocate buffer
item.external_data = std::make_unique<uint8_t[]>(size);
std::memcpy(item.external_data.get(), data, size);

// Create external buffer (zero-copy)
napi_create_external_buffer(env, size, item.external_data.get(),
                            nullptr, nullptr, &buffer);

// Buffer lifetime:
// - Owned by unique_ptr in ParsedItem
// - ParsedItem lives in BatchMsg
// - BatchMsg deleted after JS callback processes it
// - Safe: TSFN callbacks execute synchronously

Worker Parser (Structured Cloning)

// Worker thread: Create object
const parsed = JSON.parse(json);  // POJSO created

// Main thread: Receive via structured cloning
// - Entire object graph is serialized
// - Transferred across thread boundary
// - Deserialized on main thread
// - New objects created (memory copied)

Key Design Decisions

Why Zero-Copy Buffers?

  • Performance: Eliminates memory copying overhead
  • Efficiency: V8's JSON.parse() is highly optimized
  • Scalability: Better for high-throughput scenarios

Why Background Thread for I/O?

  • Non-blocking: Main thread stays responsive
  • Throughput: Can read large buffers efficiently
  • Resilience: Performance maintained under load

Why TSFN Instead of postMessage?

  • Efficiency: No serialization overhead
  • Zero-copy: Direct memory transfer
  • Lower latency: Synchronous callbacks

Why Support Both passRawBuffers Modes?

  • passRawBuffers: true (default): Best performance, uses V8's optimized JSON.parse()
  • passRawBuffers: false: Useful for debugging, C++ JSON parsing for comparison

File Descriptor Handling

Duplication Strategy

// Duplicate fd so we can:
// 1. Close it independently to break blocking read
// 2. Read from background thread safely
inst->fd_dup = dup(inst->fd);

// Background thread uses fd_dup
read(inst->fd_dup, buf, size);

// Main thread can close original fd
// Background thread continues with fd_dup

Error Handling

// Handle non-blocking FDs gracefully
if (errno == EAGAIN || errno == EWOULDBLOCK) {
  std::this_thread::sleep_for(std::chrono::milliseconds(1));
  continue;
}

// Handle interrupts
if (errno == EINTR) {
  continue;
}

Batch Processing

Why Batching?

  • Reduced overhead: Fewer callbacks = better performance
  • Better throughput: Process multiple items at once
  • Lower latency: Amortize callback cost

Batch Size Tuning

  • Small batches (64-128): Lower latency, more callbacks
  • Large batches (2048+): Better throughput, higher latency
  • Default (2048): Good balance for most use cases

Where JSON Parsing Happens

Important: The location of JSON parsing depends on the mode:

passRawBuffers: true (Default - Optimized)

  • JSON Parsing: JS main thread using V8's JSON.parse()
  • I/O: C++ background thread
  • Why: V8's JSON.parse() is highly optimized, even on main thread

passRawBuffers: false (C++ Parsing)

  • JSON Parsing: C++ background thread using C++ JSON parser
  • I/O: C++ background thread
  • Why: Useful for debugging, but slower than V8's parser

See JSON Parsing Location for detailed explanation.

Summary

The architecture is designed to:

  1. Minimize data copying - Zero-copy buffers where possible
  2. Offload I/O - Background threads for file operations
  3. Maintain responsiveness - Non-blocking main thread
  4. Scale under load - Resilient performance characteristics
  5. Support multiple use cases - Streams, FDs, different performance needs

Each parser implementation is optimized for its specific use case, providing the best performance characteristics for different scenarios.