Feature Proposal: Support for Protobuf ingestion #1059

jwils · 2026-02-28T15:59:02Z

jwils
Feb 28, 2026
Maintainer

Summary

This proposal outlines adding Protocol Buffers (protobuf) schema generation to ElasticGraph via a new elasticgraph-protobuf gem that plugs into a new pluggable ingestion serializer architecture. Rather than treating JSON Schema as hardcoded core logic and proto as a pure extension, we first extract JSON Schema into its own elasticgraph-json_schema gem (PR #1079), then implement protobuf as a peer serializer using the same interface. Proto artifacts are generated alongside JSON Schema artifacts, offering protobuf as an alternative ingestion format — not a replacement.

Our company standardizes on protobuf over Kafka as its event transport mechanism. Other organizations using ElasticGraph similarly standardize on protobuf, with transports including Pulsar. Today, events flowing into ElasticGraph are described by an ElasticGraph-specific JSON Schema, which means only ElasticGraph can consume them. By generating .proto schemas from ElasticGraph's schema definition, we unlock these same event streams for every protobuf-aware consumer in the organization — data pipelines, analytics platforms, other microservices — all using the standard Kafka infrastructure that already exists. Additionally we enable a standard mechanism to publish to ElasticGraph across all languages using a standard protocol.

Context

The Problem: ElasticGraph Events Are Only Consumable by ElasticGraph

ElasticGraph currently generates JSON Schema artifacts from the schema definition DSL. These artifacts describe an ElasticGraphEventEnvelope wrapper and all indexed type definitions. They serve two roles:

Publisher contract (json_schemas.yaml): Given to event publishers for pre-validation and code generation.
Indexer validation + record preparation (json_schemas_by_version/vN.yaml): Used at indexing time to validate incoming events and transform field values before writing to the datastore.

This pipeline works well for ElasticGraph itself. However, the JSON Schema contract is ElasticGraph-specific — the internal versioned artifacts (json_schemas_by_version/vN.yaml) contain custom ElasticGraph metadata blocks used for field name translation and type dispatch. While the public json_schemas.yaml artifact does not contain these metadata blocks, it still uses an ElasticGraph-defined envelope format and is versioned through an ElasticGraph-specific mechanism. No other system in the organization can easily consume these events.

Meanwhile, the rest of Block communicates over Kafka using protobuf-encoded messages. Teams that want to consume the same domain events that flow into ElasticGraph must either:

Build a separate publisher that duplicates the events in protobuf format, or
Write custom translation layers to convert ElasticGraph's JSON format into something they can use

Both options create maintenance burden and drift risk.

The Performance Problem: JSON Schema Validation as a Bottleneck

Beyond interoperability, load testing has identified JSON Schema validation as a bottleneck in the indexing pipeline. Every incoming event is validated against the full JSON schema before indexing, which adds latency proportional to event complexity. Protobuf deserialization is schema-validated by construction — the generated code enforces types at compile time — so proto-ingested events can skip the JSON Schema validation step entirely, potentially improving indexer throughput.

The Opportunity: One Schema, Many Consumers

If ElasticGraph generates a .proto schema from its schema definition, publishers can produce protobuf-encoded events that are simultaneously:

Consumable by ElasticGraph for indexing into the datastore
Consumable by any other team using the company's standard Kafka + protobuf infrastructure

This turns ElasticGraph's schema definition into a single source of truth for the event contract, rather than an isolated artifact that only ElasticGraph understands.

                          ┌─────────────────────┐
                          │  ElasticGraph Schema │
                          │     Definition       │
                          └──────────┬───────────┘
                                     │
                          ┌──────────┴───────────┐
                          │   schema.proto        │
                          │   (generated)         │
                          └──────────┬───────────┘
                                     │
                    ┌────────────────┼────────────────┐
                    │                │                 │
              ┌─────▼─────┐   ┌─────▼─────┐   ┌──────▼──────┐
              │ ElasticGraph│   │  Analytics │   │   Other     │
              │  Indexer    │   │  Pipeline  │   │  Services   │
              └────────────┘   └───────────┘   └─────────────┘
                    ▲                ▲                 ▲
                    │                │                 │
                    └────────────────┴─────────────────┘
                              Kafka (protobuf)

Why Protobuf Specifically

Protobuf is the right choice because it is widely deployed across the industry and is the organizational standard at both Block and other companies using ElasticGraph. That said, it also brings inherent advantages worth noting:

Aspect	JSON Schema	Protobuf
Code generation	Fragmented, varies by language	First-class `protoc` for all major languages
Encoding	Text (JSON) — human-readable	Binary — compact, typically 3-10x smaller
Schema evolution	Manual version bumps; multi-version files	Field numbers are stable; add/remove freely
Field identity	String names on the wire	Integer field numbers on the wire
Backward compatibility	Nearest-version matching with fallback	Built into the wire format by design

JSON Schema remains valuable — its human readability makes it excellent for debugging, and the existing pipeline is mature and well-tested. The two formats serve complementary roles: JSON Schema for ElasticGraph's internal validation pipeline, and protobuf for cross-organization event interoperability.

Proposed Solution

Phase 1: Proto Schema Generation

A new elasticgraph-protobuf gem that generates .proto schema artifacts from the ElasticGraph schema definition. This builds on a prerequisite refactor (PR #1079) that extracts JSON Schema generation into elasticgraph-json_schema, establishing a pluggable ingestion serializer architecture. Both elasticgraph-json_schema and elasticgraph-protobuf implement the same extension pattern (APIExtension → FactoryExtension → ResultsExtension + SchemaArtifactManagerExtension), and EG projects can use either or both via ingestion_serializers configuration. JSON Schema generation continues to work exactly as before — proto artifacts are generated alongside, not instead of, the existing artifacts.

How It Works

The gem follows the established extension pattern (identical to elasticgraph-apollo and elasticgraph-warehouse):

APIExtension.extended(api) hooks into the factory and registers proto types for all built-in scalars:

ElasticGraph.define_schema do |schema|
  schema.extend ElasticGraph::Proto::SchemaDefinition::APIExtension

  schema.proto_schema_artifacts(
    package_name: "myapp.events.v1"
  )
end

** Walk all indexed types, registering messages and enums:
- Field numbers are assigned from a stable mapping file, or sequentially for new fields
- Enum values get an _UNSPECIFIED zero-value as required by proto3
- Reserved proto keywords are escaped with a _ suffix via Identifier
SchemaArtifactManagerExtension intercepts artifacts_from_schema_def to:
- Load the field-number mapping YAML before generation
- Emit schema.proto and proto_field_numbers.yaml alongside the existing JSON Schema artifacts

Example Output

Given this schema definition:

schema.object_type "Widget" do |t|
  t.field "id", "ID!"
  t.field "name", "String"
  t.field "cost", "Float"
  t.field "tags", "[String!]!"
  t.field "status", "WidgetStatus"
  t.index "widgets"
end

schema.enum_type "WidgetStatus" do |t|
  t.value "ACTIVE"
  t.value "RETIRED"
end

The gem generates:

syntax = "proto3";

package myapp.events.v1;

enum WidgetStatus {
  WIDGET_STATUS_UNSPECIFIED = 0;
  WIDGET_STATUS_ACTIVE = 1;
  WIDGET_STATUS_RETIRED = 2;
}

message Widget {
  string id = 1;
  string name = 2;
  double cost = 3;
  repeated string tags = 4;
  WidgetStatus status = 5;
}

And a stable mapping file (proto_field_numbers.yaml):

messages:
  Widget:
    id: 1
    name: 2
    cost: 3
    tags: 4
    status: 5

Field Number Stability

The field-number mapping file ensures proto field numbers remain stable across schema changes:

On first run: Numbers are assigned sequentially (1, 2, 3, ...) and written to the YAML file
On subsequent runs: Existing numbers are read from the file and preserved; only new fields get new numbers
Removed fields: Their numbers remain in the YAML file as reservations, preventing reuse

This provides a simple, append-only stability guarantee for the proto wire format without requiring changes to the existing JSON Schema versioning mechanism.

Index Field Name Mapping

ElasticGraph uses nameInIndex to allow the public-facing field name (as publishers and the GraphQL schema see it) to differ from the field name stored in the datastore index. This mapping is intentionally absent from public artifacts — publishers should not have access to internal field names. We've identified three approaches for handling this in proto:

Option A: YAML Sidecar

Extend the proto_field_numbers.yaml internal artifact (already needed for field number stability) to carry nameInIndex overrides:

messages:
  Widget:
    fields:
      id: 1
      public_name:
        field_number: 2
        name_in_index: "internalName"
      cost: 3

The generated schema.proto stays completely clean — no custom annotations, no EG-specific imports:

message Widget {
  string id = 1;
  string public_name = 2;
  double cost = 3;
}

At indexing time, the EG indexer reads the YAML to build a field-number-to-index-name lookup table, translating field names during deserialization.

Aspect	Assessment
Public artifact cleanliness	Best — standard proto3, no EG-specific annotations or imports
Consistency with EG patterns	Mirrors how `json_schemas_by_version/vN.yaml` carries `nameInIndex` as internal metadata separate from the public `json_schemas.yaml`
Runtime dependencies	None — just YAML parsing, which EG already does everywhere
Tradeoff	Name mapping lives outside the proto ecosystem. Proto-aware tooling (linters, registries) can't see or validate the relationship between public names and index names

Option B: Custom Proto Options

Use protobuf's custom options to annotate fields with their index name directly in schema.proto:

// elastic_graph/options.proto
syntax = "proto2";
import "google/protobuf/descriptor.proto";

message ElasticGraphFieldOptions {
  optional string name_in_index = 1;
}

extend google.protobuf.FieldOptions {
  optional ElasticGraphFieldOptions elastic_graph = 50000;
}

import "elastic_graph/options.proto";

message Widget {
  string id = 1;
  string public_name = 2 [(elastic_graph).name_in_index = "internalName"];
  double cost = 3;
}

Custom options have zero wire format impact — serialized bytes are identical with or without annotations. Consumers who don't import elastic_graph/options.proto can use the generated message code normally and never see the annotations. The EG indexer reads the option via Ruby's descriptor API at runtime.

Aspect	Assessment
Public artifact cleanliness	Weaker — `schema.proto` contains the `import` and visible `[(elastic_graph)...]` annotations, exposing internal field names in the source file even though consumers aren't functionally affected
Proto ecosystem integration	Best — metadata lives in the proto schema itself; proto-aware tooling can see and validate it
Runtime dependencies	Requires `google-protobuf` gem >= 23.x (older versions silently drop custom options from the runtime descriptor). Options `.proto` must use `proto2` syntax for the `extend` keyword; consuming files remain `proto3`
Tradeoff	Internal field names are visible in the public `.proto` file, even though they don't affect the wire format or consumer code generation

Option C: Separate Public and Internal Proto Files

Generate two .proto files with identical field numbers but different field names:

schema.proto (public): Uses GraphQL field names. Given to publishers and external consumers.
schema_internal.proto (private): Uses nameInIndex as field names. Used by the EG indexer for direct deserialization into index-ready records.

// schema.proto (public)
message Widget {
  string id = 1;
  string public_name = 2;
  double cost = 3;
}

// schema_internal.proto (private)
message Widget {
  string id = 1;
  string internalName = 2;
  double cost = 3;
}

Since proto uses field numbers on the wire (not names), both files produce identical serialized bytes. Publishers generate code from the public file; the indexer generates code from the internal file.

This can also be combined with Option B: the internal file carries custom option annotations (e.g., for additional EG metadata) while the public file remains completely clean. This gives the indexer both the direct index-ready field names and extensible metadata, without any of it leaking into the public contract.

// schema_internal.proto (private, with Option B annotations)
import "elastic_graph/options.proto";

message Widget {
  string id = 1;
  string internalName = 2 [(elastic_graph).public_name = "public_name"];
  double cost = 3;
}

Aspect	Assessment
Public artifact cleanliness	Best — public file is completely clean, internal file is never distributed
Simplicity	Straightforward — each side generates code from its own `.proto` and uses it directly
Extensibility	When combined with Option B annotations on the internal file, the indexer gets both direct field names and extensible metadata without affecting the public contract
Runtime dependencies	None beyond standard protobuf (Option B hybrid requires `google-protobuf` >= 23.x only for the indexer, not publishers)
Tradeoff	Two files to maintain and keep in sync (though both are generated from the same schema definition, so drift is prevented by construction). Proto-aware tooling (schema registries, `buf` breaking checks) needs to know which file to validate

Recommendation

We lean toward Option A (YAML sidecar) because it keeps the public proto artifact completely clean while reusing an internal artifact that already exists for field number stability. It also mirrors the existing JSON Schema pattern where nameInIndex lives in internal artifacts only. However, we'd welcome input — particularly on whether proto ecosystem integration (Option B), the simplicity of direct code generation from separate files (Option C), or the hybrid approach (Option C + B) is more valuable for your use case.

Breaking Change Detection

While protobuf's field-number system handles most schema evolution natively, some changes (e.g., changing a field's type incompatibly) can still break consumers. We integrate buf into the artifact dump flow:

On schema_artifacts:dump, generate the new .proto files
Run buf breaking against the previously dumped .proto
If breaking changes detected, raise an error with guidance to either revert the change or increment the schema version
If the version is bumped, archive the old .proto (analogous to json_schemas_by_version/)

buf is a prerequisite tool (like protoc), not a gem dependency, to keep the dependency footprint light.

Phase 2: Proto-Based Ingestion as an Alternative Path

This phase adds protobuf as an alternative ingestion path into ElasticGraph's indexer, running alongside the existing JSON Schema pipeline. The goal is to allow publishers that produce protobuf events over Kafka to send those events directly to ElasticGraph without converting to JSON first — the same events that other consumers in the organization are already reading.

A key design goal is maximizing consumer reach: other consumers should be able to read the domain events as standard proto messages without needing any knowledge of ElasticGraph's internal conventions.

Current Architecture

One common deployment pattern ingests events via SQS → Lambda:

Publishers → SQS → Lambda → ElasticGraph Indexer

Events arrive as JSON Lines in SQS message bodies. The SqsProcessor extracts SQS system attributes (SentTimestamp, messageId) and parses each JSON line as an ElasticGraphEventEnvelope containing op, type, id, version, and record.

Note: The core elasticgraph-indexer architecture is transport-agnostic — it exposes a Ruby object for ingestion. SQS/Lambda is one deployment pattern; others (HTTP, Kafka/Pulsar consumers, etc.) are equally supported.

The proto ingestion design must work with this architecture while also being transport-agnostic — supporting Kafka, Pulsar, SQS, and potentially other transports without requiring payload changes.

Event Metadata: Transport-Level Attributes vs Proto Envelope

ElasticGraph requires metadata alongside each domain event for indexing: the operation type (op), the record type (type), a unique identifier (id), and a monotonic version for conflict resolution (version). The central design question is where this metadata lives.

We evaluated three approaches:

Option A: Transport-level attributes for metadata, raw proto message as payload

Kafka Headers / Pulsar Properties / SQS Message Attributes:
  eg_op:      "upsert"
  eg_type:    "Widget"
  eg_id:      "widget-123"
  eg_version: "7"

Payload: <raw serialized Widget proto bytes>

The payload is the domain proto message itself — nothing wrapping it. ElasticGraph-specific metadata travels as transport-level key-value pairs: Kafka record headers, Pulsar message properties, or SQS message attributes. All transports support this natively.

On Kafka, headers are byte[] key-value pairs on each record:

ProducerRecord<String, byte[]> record = new ProducerRecord<>(topic, key, widget.toByteArray());
record.headers().add("eg_op", "upsert".getBytes(UTF_8));
record.headers().add("eg_type", "Widget".getBytes(UTF_8));
record.headers().add("eg_id", "widget-123".getBytes(UTF_8));
record.headers().add("eg_version", "7".getBytes(UTF_8));

On Pulsar, properties are String → String key-value pairs on each message:

producer.newMessage()
    .property("eg_op", "upsert")
    .property("eg_type", "Widget")
    .property("eg_id", "widget-123")
    .property("eg_version", "7")
    .value(widget.toByteArray())
    .send();

On SQS, message attributes are typed key-value pairs on each message:

{
  "messageAttributes": {
    "eg_op":      { "stringValue": "upsert",     "dataType": "String" },
    "eg_type":    { "stringValue": "Widget",      "dataType": "String" },
    "eg_id":      { "stringValue": "widget-123",  "dataType": "String" },
    "eg_version": { "stringValue": "7",           "dataType": "String" }
  },
  "body": "<base64-encoded Widget proto bytes>"
}

On any other transport (HTTP, gRPC, AMQP, etc.), the same pattern applies: metadata goes in the transport's native header/attribute mechanism, and the body is the raw proto bytes.

Aspect	Assessment
Consumer reach	Best. Other consumers read the Kafka/Pulsar value as a standard proto message. They don't need to know ElasticGraph exists.
Schema Registry	Works naturally — the Kafka value is a real proto message type, not an opaque `bytes` blob. The registry can track and validate it.
Transport agnostic	The payload is identical across transports. Only the metadata carrier changes (Kafka headers → Pulsar properties → SQS attributes → HTTP headers, etc.).
Routing performance	Metadata is accessible without deserializing the payload. Infrastructure can route, filter, or fan-out based on `eg_type` without touching the proto bytes.

Tradeoff: Metadata does not travel inside the serialized payload. If the message is extracted from its transport context (e.g., written to a file, stored in a database blob), the metadata is lost unless explicitly preserved. This approach also does not support batching multiple events into a single transport message.

Option B: `oneof` wrapper with all indexed types

message ElasticGraphEventEnvelope {
  string op = 1;
  string id = 2;
  int64 version = 3;

  oneof record {
    Widget widget = 11;
    Gadget gadget = 12;
    // ... one field per indexed type
  }
}

Aspect	Assessment
Consumer reach	Weaker. Every consumer must understand the `ElasticGraphEventEnvelope` wrapper and check the `oneof` case, even if they only care about `Widget`.
Schema Registry	Excellent with Confluent SR — each inner type registers as a schema reference.
Type safety	Best compile-time safety: consumers get exhaustive `oneof` handling (Scala pattern matching, Go switch).
Schema evolution	Adding a new `oneof` variant is backward-compatible — old consumers that don't recognize the new variant see the `oneof` as "not set" and skip it. No coordinated rollout needed.
Transport agnostic	Good — the envelope is self-contained.

Tradeoff: Forces every consumer to couple to an ElasticGraph-specific envelope schema, which undermines the goal of making events consumable via the company's standard format.

Option C: Proto envelope with `bytes record`

message ElasticGraphEventEnvelope {
  string op = 1;
  string type = 2;
  string id = 3;
  int64 version = 4;
  bytes record = 5;     // opaque serialized proto
}

Aspect	Assessment
Consumer reach	Weakest. Every consumer must double-deserialize: first the envelope, then the opaque `bytes` field using `type` as a hint.
Schema Registry	Cannot track or validate the inner schema — the `bytes` field is opaque.
Type safety	None at the proto level for the payload.
Transport agnostic	Good — self-contained.

Tradeoff: Loses the primary benefit of protobuf (type safety and schema validation) for the actual domain data. The research consensus is to avoid this pattern in favor of google.protobuf.Any at minimum, or transport-level metadata for best results.

Recommendation: Support Both Approaches

After discussion, we recommend supporting both transport-level attributes and envelope-wrapped records:

Raw proto + transport-level attributes (Option A) — for transports where one event = one message (typical Kafka/Pulsar usage). The payload is the raw domain proto message; metadata travels as Kafka headers / Pulsar properties / SQS message attributes.
Envelope-wrapped proto (Option B) — for transports where multiple events are batched into a single message (SQS JSON Lines, HTTP bulk). Uses a generated ElasticGraphEventEnvelope proto with a oneof record containing all indexed types.

The raw approach maximizes consumer reach (other consumers see standard proto messages with no EG-specific wrapper). The envelope approach supports efficient batching and preserves metadata alongside the payload.

The ingestion format is explicitly configured (e.g., ingestion_format: :envelope vs :raw) rather than auto-detected, to avoid ambiguity.

Pulsar compatibility confirmed: Pulsar has first-class support for message-level properties (String → String key-value pairs), functionally equivalent to Kafka headers:

producer.newMessage()
    .property("eg_op", "upsert")
    .property("eg_type", "Widget")
    .property("eg_id", "widget-123")
    .property("eg_version", "7")
    .value(widget.toByteArray())
    .send();

The metadata attributes follow a simple, documented convention (with the ability to overwrite field names).

Attribute	Type	Description
`eg_op`	string	Operation: `"upsert"` (or `"delete"` in the future)
`eg_type`	string	Indexed type name (e.g., `"Widget"`)
`eg_id`	string	Unique record identifier
`eg_version`	string	Monotonic version for conflict resolution

These attributes are carried as:

Kafka: Record headers (byte[] key-value pairs, UTF-8 encoded)
Pulsar: Message properties (String → String key-value pairs — no duplicate keys)
SQS: Message attributes (when using single-event-per-message mode)
HTTP/other: Request headers or equivalent transport metadata

Any transport that supports key-value metadata alongside a binary payload can carry these events. The eg_ prefix avoids collisions with other headers/attributes.

How Proto Deserialization Maps to `RecordPreparer`

The current RecordPreparer does three things that proto handles differently:

`RecordPreparer` Responsibility	Proto Equivalent
Field presence filtering (drop unknown fields)	Protobuf ignores unknown fields by default
Field name translation (`nameInIndex` mapping)	Field numbers are the wire identity — proto field names don't appear on the wire. The chosen index field name mapping approach (see Phase 1) provides the EG indexer with the `nameInIndex` for each field. The public `schema.proto` uses only the GraphQL field names that publishers see.
Type dispatch (scalar preparers for DateTime, etc.)	EG's custom scalar types are mapped to appropriate proto types during generation (e.g., `DateTime` → `google.protobuf.Timestamp`, `JsonSafeLong` → `int64`, `Untyped` → `string` containing canonical JSON). `google.protobuf.Struct` was considered for `Untyped` but rejected because it collapses all numbers to `double`, losing the integer/float distinction that `UntypedEncoder` preserves — since `Untyped` fields are stored as `keyword` in the datastore, `"3"` and `"3.0"` are distinct values and conflating them would break equality filters. A full scalar mapping table will be documented in the implementation PR.

Schema Evolution with Proto

With protobuf, schema evolution is governed by the proto3 compatibility rules:

Adding fields: Always safe — old consumers ignore unknown field numbers
Removing fields: Safe if the field number is reserved (the mapping file handles this)
Renaming fields: Always safe — only numbers are on the wire
Changing field types: Restricted by proto compatibility rules (e.g., int32 ↔ int64 is safe)

For proto-based publishers, this means no coordination with ElasticGraph's json_schema_version mechanism is needed. The proto_field_numbers.yaml file is the sole source of truth for wire format stability on the proto side.

Edge Cases

Several important edge cases:

Recursive types: Not applicable — ElasticGraph already prevents recursive type definitions because Elasticsearch/OpenSearch mappings require nested subfields to terminate at leaf nodes. The proto generator inherits this constraint.
Enum value scoping: Proto3 scopes enum values to the enclosing package or message, not to the enum itself. Two sibling enums with the same value name (e.g., TOAST) would collide. The generator prevents this by prefixing all enum values with the UPPER_SNAKE_CASE of the enum type name (e.g., BREAD_TYPE_TOAST), following the official proto style guide and buf lint default rules (ENUM_VALUE_PREFIX).
Deeply nested lists: Lists of depth > 1 (e.g., [[String]]) generate synthetic wrapper messages like MatrixValuesListLevel1
Proto keyword collisions: All identifiers are checked against 37 proto3 reserved words and escaped with _ suffix
Name collisions after escaping: Two types that map to the same proto name after keyword escaping raise a clear error

Optional: JSON Schema Replacement Mode

For deployments that have fully migrated all publishers to protobuf and no longer need JSON Schema artifacts, the gem also supports replace_json_schemas: true. This strips JSON Schema artifacts from the dump output and bypasses the JSON Schema version-bump checks. This is entirely opt-in and is not the expected usage.

Benefits

Unlocks cross-organization event consumption: The primary benefit — events published for ElasticGraph become consumable by any protobuf-aware system in the organization via the standard Kafka/Pulsar infrastructure
Single source of truth: ElasticGraph's schema definition generates both the JSON Schema (for ElasticGraph) and the proto schema (for everyone else), eliminating duplicate schema maintenance
Transport agnostic: The same metadata convention (transport-level attributes + raw proto payload) works identically across Kafka, Pulsar, SQS, and any other transport with key-value metadata support
Cross-language code generation: Publishers and consumers in any language get strongly-typed data classes from a single .proto file
Stable field numbers: The mapping file provides an auditable, append-only record of all field assignments
Simpler evolution for proto consumers: No version bumps needed — field numbers handle backward compatibility natively
Improved indexer throughput: Proto-ingested events skip JSON Schema validation (which load testing has identified as a bottleneck), since protobuf deserialization is schema-validated by construction

Alternatives Considered

Maintain Separate Proto Schemas Manually

We considered hand-writing .proto files independently from ElasticGraph's schema definition. This was rejected because:

Two schemas describing the same data inevitably drift
Every schema change requires updating both the ElasticGraph definition and the .proto file
Generating from a single source eliminates an entire class of inconsistency bugs

Apache Avro

Avro was considered as another binary serialization format with good Kafka ecosystem support. It was not chosen because:

Block's organization has standardized on protobuf, not Avro
Avro requires a schema registry for reader/writer schema negotiation
Protobuf's field-number system is simpler and doesn't require a registry

FlatBuffers / Cap'n Proto

These zero-copy serialization formats offer even better performance, but were not chosen because:

Not the organizational standard
Much smaller ecosystems and fewer language bindings
Protobuf is the industry standard with the broadest tooling support

Feedback Wanted

If you have interest in this feature, please review the proposal and let me know what you think! In particular:

Are there other organizations that would benefit from the generated .proto schema and any specific features needed for them?
Are there indexed types in your schema with custom scalar types that might need special proto type mapping?
Does the transport-level metadata attribute approach work with your infrastructure, or do you need the envelope approach for batching?
Which nameInIndex mapping option (A: YAML sidecar, B: custom proto options, C: separate files, or C+B hybrid) best fits your use case?

myronmarston · 2026-03-20T21:46:12Z

myronmarston
Mar 20, 2026
Maintainer

My overall thoughts: this is fantastic. I've got some initial/overall thoughts and then some replies to specific parts of the RFC.

I think it's likely that we'd also use this, and instead of making the JSON schema approach "primary" and proto a pure extension, we could introduce the concept of pluggable ingestion serializers:

We could design a proper interface for this with clear APIs for each JSON-schema-specific piece.
The current JSON schema logic could be extracted into a new elasticgraph-json_schema gem. (Actually, such a gem already exists, but its original logic got moved into elasticgraph-support and we no longer publish new versions of that gem--so we could repurpose it).
There could be a new elasticgraph-protobuf gem which implements the proto version of the same interface.
EG projects could choose which serializers they use. In some cases (such as when migrating from one to the other) both could be used. I'm imagining a configuration API like ingestion_serializers = [:json, :proto]. Or maybe event_serializers.

We probably also want to come up with some format-specific terms for shared things. For example, instead of taking about the "json schema artifacts" we can talk about the "ingestion schema artifacts" of which the json schema artifacts are merely one flavor.

Ideally, we'd refactor the codebase before introducing the proto support so that the core logic is format-agnostic and the code clearly expresses that.

With that said: that's a lot more work than just implementing this as a pure extension and leaving JSON schema as core...

Here are some specific responses to parts of your proposal:

The Problem: ElasticGraph Events Are Only Consumable by ElasticGraph

There's a second problem that our load testing with ElasticGraph has identified: the JSON schema validation can be a bottleneck. Protos can help solve this because they don't need heavy weight validation as the structure is more fixed than raw JSON.

However, the JSON Schema contract is ElasticGraph-specific — it contains custom ElasticGraph metadata blocks

This isn't quite accurate--the custom ElasticGraph metadata blocks only show up in the internal json_schemas_by_version/vN.yaml artifacts. The "public" JSON schema artifact (json_schemas.yaml) is intended to be the one that defines the contract with the publisher, and it does not have these custom blocks.

Protobuf is the right choice for Block because it is the organizational standard.

We use protobuf as well and I think it's widely deployed in the industry, so I'd argue it's a good choice beyond Block's usage of it :).

field_number_mapping_file: "proto_field_numbers.yaml"

All other schema artifacts have a hardcoded name and that's worked fine. I don't think we need to allow the name to be customized.

Enforcement mode: When enforce_field_number_mapping: true, the mapping file must exist

I'm not sure this is worth the complexity/added-code to enforce. Once a field mapping file is there, it's there, and it would only go away if it got intentionally deleted. And if that happened, the renumberings would be visible in the proto file during code review. A reason EG dumps schema artifacts (instead of generating them at boot time of indexer or graphql from the raw schema) is to make them visible during code review.

ElasticGraph currently ingests events via SQS → Lambda:

Slight correction: it does if you use elasticgraph-indexer_lambda with SQS. The core elasticgraph-indexer architecture is agnostic to the transport of events or the compute platform--it just exposes a ruby object you can use for ingestion.

Kafka record headers or SQS message attributes.

Worth noting that we use pulsar, not kafka--we should confirm that record headers work with pulsar, too. (I myself haven't really dealt with pulsar to be able to answer that. @markyang-toast: do you know?)

Regarding SQS message headers: typical SQS usage with ElasticGraph involves packing many events into a single SQS message. Using message attributes would require a more complex data structure and we might run into size limits.

On any other transport (HTTP, gRPC, AMQP, etc.), the same pattern applies: metadata goes in the transport's native header/attribute mechanism, and the body is the raw proto bytes.

I'm somewhat concerned with assuming that headers will work for all transports. With many transports, it'll help with efficiency to pack many events into a single message (as we do with SQS) so that the per-request overhead is amortized over many events. But once we do that, the headers need a complex data structure and we're likely to run into limitations there.

Take HTTP for example: just like the OpenSearch/Elasticsearch _bulk HTTP endpoint supports many updates in one call for efficiency, HTTP transport for EG ingestion would benefit from packing many events into a single HTTP body, and conveying the metadata via headers is unlikely to work well. It's worth noting that the _bulk endpoint solves this by having alternating header/payload JSON lines in the HTTP body.

In our case, we already have an envelope format that's designed to wrap an ElasticGraph event. I think we could generate an envelope proto (with a oneof for the record). EG could accomodate either form: raw records (in which case it uses headers) or envelope-wrapped records (in which case it uses the envelope). We could make it configurable, although if it's simple to make it do the right thing w/o configuration by intelligently detecting what it's working with, that would be nice.

(Side note: I wrote the above before getting to your section where you discuss the option of keeping the envelope format. I think my point stands though.)

the metadata is lost unless explicitly preserved.

This also concerns me. If we go the route I'm suggesting (support either the envelope or headers) than we can document this as a caveat of the header approach.

Adding a new indexed type is a forward-incompatible change — old consumers see UNKNOWN for the new oneof variant. Requires coordinated consumer rollouts.

I'm not following this. Why would it be a forward-incompatible change requiring coordinated rollouts? I've added new options to oneofs in the past and not had to coordinate the rollout...

Field numbers are the wire identity; the .proto file can use nameInIndex directly as the field name

nameInIndex is intended to be the internal/private name of the field, which does not show up in public schema artifacts--we don't want external stakeholders to even have access to that information. They should only have access to the public field names they deal with.

So, if we're going to use nameInIndex in proto files, I think we need to generate multiple proto files--a private, internal one and a public, shared one. They can share the same field numbers but use different field names.

or custom scalar handling at deserialization time

How does custom scalar handling work with protos? I've never heard or seen that as a proto feature.

A useful example of a custom indexing preparer is for the Untyped type. It uses UntypedEncoder to produce canonicalized JSON.

For proto-based publishers, this means no coordination with ElasticGraph's json_schema_version mechanism is needed.

That's fantastic but:

With JSON schema a much larger set of schema changes are possible--the versioning mechanism (and the fact that we keep old schemas around) allows it to support virtually any change to the schema. With protos its more limited.
Evolving schemas with protos require being aware of the rules. Some are obvious (e.g. don't renumber fields); others (like the type compatibilty) are less obvious.

We should consider integrating breaking change detection using buf. We could potentially combine this with the existing schema versioning mechanism:

Use buf to detect breaking changes.
When there are breaking changes between the currently dumped proto and the updated proto we're trying to dump, raise an error. (This is like what we already do for enforcing json_schema_version).
The error can give the user the option of reverting their breaking change or incrementing the schema version.

In most cases, bumping the schema version will never be necessary but when a situation arise where you really need to make a breaking change, there's a mechanism to do so.

Recursive types: A message referencing itself is handled via placeholder registration (prevents infinite recursion)

EG already prevents this because it's not possible to define an Elasticsearch/OpenSearch mapping for a recursive type that references itself--at some point, the nested subfields need to terminate at leaf nodes.

2 replies

jwils Mar 21, 2026
Maintainer Author

Thanks for the thorough feedback

Pluggable Ingestion Serializers
Makes a lot of sense - started to prototype it #1079

Terminology
Agreed — "ingestion schema artifacts" as the umbrella term with "JSON schema artifacts" and "proto schema artifacts" as flavors.

Simplifications

field_number_mapping_file: I'll hardcode the name.
enforce_field_number_mapping: Agreed will drop

Transport-Level Metadata & Batching

Supporting both makes sense. We could make detection automatic: if the payload deserializes as an ElasticGraphEventEnvelope, use the envelope path; otherwise, read metadata from transport attributes. But I think explicit configuration (ingestion_format: :envelope vs :raw) is safer and avoids ambiguity. Thoughts?

** Pulsar Compatibility**
confirmed for your Pulsar use case. Pulsar has first-class support for message-level properties, which are functionally equivalent to Kafka headers:

  producer.newMessage()
      .property("eg_op", "upsert")
      .property("eg_type", "Widget")
      .property("eg_id", "widget-123")
      .property("eg_version", "7")
      .value(widget.toByteArray())
      .send();

One difference worth noting: Pulsar properties are String -> String (not byte[] -> byte[] like Kafka headers), but since all our metadata values are strings, this is a non-issue.

nameInIndex Privacy
Few options:

Option A (YAML sidecar): Extend proto_field_numbers.yaml to carry name_in_index overrides. The public
schema.proto stays completely clean. Mirrors how json_schemas_by_version/vN.yaml carries nameInIndex
separately from the public json_schemas.yaml.
Option B (Custom proto options): Annotate fields with [(elastic_graph).name_in_index = "internalName"]
directly in the .proto file. Zero wire format impact, but the annotations are visible in the public
file.
Option C (Separate proto files): Generate schema.proto (public, GraphQL names) and
schema_internal.proto (private, index names) with identical field numbers. Could also combine with option B and have the annotations on the private proto only.

Custom Scalar Handling
What I meant is that we'd handle EG's custom scalar types during proto generation by mapping them to appropriate proto types (e.g., DateTime -> google.protobuf.Timestamp, JsonSafeLong -> int64). For Untyped, I think we'd need to map it to string containing canonical JSON. Let me know if you have a different question. I like the idea of google.protobuf.Struct, but it looks like it might not work.

jwils Mar 21, 2026
Maintainer Author

Updated the proposal

anthonycastiglia-toast · 2026-03-23T21:45:53Z

anthonycastiglia-toast
Mar 23, 2026

This looks great! In addition to the edge cases you have listed, protobuf imposes a uniqueness constraint on enum constants in sibling enum classes. From the proto style guide:

Enum values are semantically considered to not be scoped by their containing enum name, so the same name in two sibling enums is not allowed.

In our case, our schema has a number of enums containing a TOAST value, so the generated proto schema won't successfully compile with protoc.

Is this something that should be accounted for when generating the proto schema artifact, or something that should be handled by the schema author?

1 reply

jwils Mar 24, 2026
Maintainer Author

Great point. This should be handled by the framework and i think we should just use the recommended style guide where we prefix every enum with the type. I've updated the discussion.

I think another thing missing from my proposal is something that you all could use too. We may want to be able to reference existing proto objects. For example, I imagine your enums already exist and it'd probably be ideal to use the predefined enums?

We have shared protos for things like currency, money objects, etc and I think it'll cause problems to redefine them vs use the shared types. Just started talking about this internally, and solving it generically adds a lot more complexity, but would that be something that's valuable to you all?

I think I'd like it to work for the pathological case where we define the eg schema to match an existing proto schema. Maybe a feature to build on top of this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Proposal: Support for Protobuf ingestion #1059

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feature Proposal: Support for Protobuf ingestion #1059

Uh oh!

Uh oh!

jwils Feb 28, 2026 Maintainer

Summary

Context

The Problem: ElasticGraph Events Are Only Consumable by ElasticGraph

The Performance Problem: JSON Schema Validation as a Bottleneck

The Opportunity: One Schema, Many Consumers

Why Protobuf Specifically

Proposed Solution

Phase 1: Proto Schema Generation

How It Works

Example Output

Field Number Stability

Index Field Name Mapping

Option A: YAML Sidecar

Option B: Custom Proto Options

Option C: Separate Public and Internal Proto Files

Recommendation

Breaking Change Detection

Phase 2: Proto-Based Ingestion as an Alternative Path

Current Architecture

Event Metadata: Transport-Level Attributes vs Proto Envelope

Option A: Transport-level attributes for metadata, raw proto message as payload

Option B: oneof wrapper with all indexed types

Option C: Proto envelope with bytes record

Recommendation: Support Both Approaches

How Proto Deserialization Maps to RecordPreparer

Schema Evolution with Proto

Edge Cases

Optional: JSON Schema Replacement Mode

Benefits

Alternatives Considered

Maintain Separate Proto Schemas Manually

Apache Avro

FlatBuffers / Cap'n Proto

Feedback Wanted

Replies: 2 comments · 3 replies

Uh oh!

myronmarston Mar 20, 2026 Maintainer

Uh oh!

jwils Mar 21, 2026 Maintainer Author

Uh oh!

jwils Mar 21, 2026 Maintainer Author

Uh oh!

anthonycastiglia-toast Mar 23, 2026

Uh oh!

jwils Mar 24, 2026 Maintainer Author

jwils
Feb 28, 2026
Maintainer

Option B: `oneof` wrapper with all indexed types

Option C: Proto envelope with `bytes record`

How Proto Deserialization Maps to `RecordPreparer`

Replies: 2 comments 3 replies

myronmarston
Mar 20, 2026
Maintainer

jwils Mar 21, 2026
Maintainer Author

jwils Mar 21, 2026
Maintainer Author

anthonycastiglia-toast
Mar 23, 2026

jwils Mar 24, 2026
Maintainer Author