Replies: 2 comments 3 replies
-
|
My overall thoughts: this is fantastic. I've got some initial/overall thoughts and then some replies to specific parts of the RFC. I think it's likely that we'd also use this, and instead of making the JSON schema approach "primary" and proto a pure extension, we could introduce the concept of pluggable ingestion serializers:
We probably also want to come up with some format-specific terms for shared things. For example, instead of taking about the "json schema artifacts" we can talk about the "ingestion schema artifacts" of which the json schema artifacts are merely one flavor. Ideally, we'd refactor the codebase before introducing the proto support so that the core logic is format-agnostic and the code clearly expresses that. With that said: that's a lot more work than just implementing this as a pure extension and leaving JSON schema as core... Here are some specific responses to parts of your proposal:
There's a second problem that our load testing with ElasticGraph has identified: the JSON schema validation can be a bottleneck. Protos can help solve this because they don't need heavy weight validation as the structure is more fixed than raw JSON.
This isn't quite accurate--the custom ElasticGraph metadata blocks only show up in the internal
We use protobuf as well and I think it's widely deployed in the industry, so I'd argue it's a good choice beyond Block's usage of it :).
All other schema artifacts have a hardcoded name and that's worked fine. I don't think we need to allow the name to be customized.
I'm not sure this is worth the complexity/added-code to enforce. Once a field mapping file is there, it's there, and it would only go away if it got intentionally deleted. And if that happened, the renumberings would be visible in the proto file during code review. A reason EG dumps schema artifacts (instead of generating them at boot time of indexer or graphql from the raw schema) is to make them visible during code review.
Slight correction: it does if you use
Worth noting that we use pulsar, not kafka--we should confirm that record headers work with pulsar, too. (I myself haven't really dealt with pulsar to be able to answer that. @markyang-toast: do you know?) Regarding SQS message headers: typical SQS usage with ElasticGraph involves packing many events into a single SQS message. Using message attributes would require a more complex data structure and we might run into size limits.
I'm somewhat concerned with assuming that headers will work for all transports. With many transports, it'll help with efficiency to pack many events into a single message (as we do with SQS) so that the per-request overhead is amortized over many events. But once we do that, the headers need a complex data structure and we're likely to run into limitations there. Take HTTP for example: just like the OpenSearch/Elasticsearch In our case, we already have an envelope format that's designed to wrap an ElasticGraph event. I think we could generate an envelope proto (with a (Side note: I wrote the above before getting to your section where you discuss the option of keeping the envelope format. I think my point stands though.)
This also concerns me. If we go the route I'm suggesting (support either the envelope or headers) than we can document this as a caveat of the header approach.
I'm not following this. Why would it be a forward-incompatible change requiring coordinated rollouts? I've added new options to
So, if we're going to use
How does custom scalar handling work with protos? I've never heard or seen that as a proto feature. A useful example of a custom indexing preparer is for the Untyped type. It uses UntypedEncoder to produce canonicalized JSON.
That's fantastic but:
We should consider integrating breaking change detection using buf. We could potentially combine this with the existing schema versioning mechanism:
In most cases, bumping the schema version will never be necessary but when a situation arise where you really need to make a breaking change, there's a mechanism to do so.
EG already prevents this because it's not possible to define an Elasticsearch/OpenSearch mapping for a recursive type that references itself--at some point, the nested subfields need to terminate at leaf nodes. |
Beta Was this translation helpful? Give feedback.
-
|
This looks great! In addition to the edge cases you have listed, protobuf imposes a uniqueness constraint on enum constants in sibling enum classes. From the proto style guide:
In our case, our schema has a number of enums containing a Is this something that should be accounted for when generating the proto schema artifact, or something that should be handled by the schema author? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
This proposal outlines adding Protocol Buffers (protobuf) schema generation to ElasticGraph via a new
elasticgraph-protobufgem that plugs into a new pluggable ingestion serializer architecture. Rather than treating JSON Schema as hardcoded core logic and proto as a pure extension, we first extract JSON Schema into its ownelasticgraph-json_schemagem (PR #1079), then implement protobuf as a peer serializer using the same interface. Proto artifacts are generated alongside JSON Schema artifacts, offering protobuf as an alternative ingestion format — not a replacement.Our company standardizes on protobuf over Kafka as its event transport mechanism. Other organizations using ElasticGraph similarly standardize on protobuf, with transports including Pulsar. Today, events flowing into ElasticGraph are described by an ElasticGraph-specific JSON Schema, which means only ElasticGraph can consume them. By generating
.protoschemas from ElasticGraph's schema definition, we unlock these same event streams for every protobuf-aware consumer in the organization — data pipelines, analytics platforms, other microservices — all using the standard Kafka infrastructure that already exists. Additionally we enable a standard mechanism to publish to ElasticGraph across all languages using a standard protocol.Context
The Problem: ElasticGraph Events Are Only Consumable by ElasticGraph
ElasticGraph currently generates JSON Schema artifacts from the schema definition DSL. These artifacts describe an
ElasticGraphEventEnvelopewrapper and all indexed type definitions. They serve two roles:json_schemas.yaml): Given to event publishers for pre-validation and code generation.json_schemas_by_version/vN.yaml): Used at indexing time to validate incoming events and transform field values before writing to the datastore.This pipeline works well for ElasticGraph itself. However, the JSON Schema contract is ElasticGraph-specific — the internal versioned artifacts (
json_schemas_by_version/vN.yaml) contain customElasticGraphmetadata blocks used for field name translation and type dispatch. While the publicjson_schemas.yamlartifact does not contain these metadata blocks, it still uses an ElasticGraph-defined envelope format and is versioned through an ElasticGraph-specific mechanism. No other system in the organization can easily consume these events.Meanwhile, the rest of Block communicates over Kafka using protobuf-encoded messages. Teams that want to consume the same domain events that flow into ElasticGraph must either:
Both options create maintenance burden and drift risk.
The Performance Problem: JSON Schema Validation as a Bottleneck
Beyond interoperability, load testing has identified JSON Schema validation as a bottleneck in the indexing pipeline. Every incoming event is validated against the full JSON schema before indexing, which adds latency proportional to event complexity. Protobuf deserialization is schema-validated by construction — the generated code enforces types at compile time — so proto-ingested events can skip the JSON Schema validation step entirely, potentially improving indexer throughput.
The Opportunity: One Schema, Many Consumers
If ElasticGraph generates a
.protoschema from its schema definition, publishers can produce protobuf-encoded events that are simultaneously:This turns ElasticGraph's schema definition into a single source of truth for the event contract, rather than an isolated artifact that only ElasticGraph understands.
Why Protobuf Specifically
Protobuf is the right choice because it is widely deployed across the industry and is the organizational standard at both Block and other companies using ElasticGraph. That said, it also brings inherent advantages worth noting:
protocfor all major languagesJSON Schema remains valuable — its human readability makes it excellent for debugging, and the existing pipeline is mature and well-tested. The two formats serve complementary roles: JSON Schema for ElasticGraph's internal validation pipeline, and protobuf for cross-organization event interoperability.
Proposed Solution
Phase 1: Proto Schema Generation
A new
elasticgraph-protobufgem that generates.protoschema artifacts from the ElasticGraph schema definition. This builds on a prerequisite refactor (PR #1079) that extracts JSON Schema generation intoelasticgraph-json_schema, establishing a pluggable ingestion serializer architecture. Bothelasticgraph-json_schemaandelasticgraph-protobufimplement the same extension pattern (APIExtension→FactoryExtension→ResultsExtension+SchemaArtifactManagerExtension), and EG projects can use either or both viaingestion_serializersconfiguration. JSON Schema generation continues to work exactly as before — proto artifacts are generated alongside, not instead of, the existing artifacts.How It Works
The gem follows the established extension pattern (identical to
elasticgraph-apolloandelasticgraph-warehouse):APIExtension.extended(api)hooks into the factory and registers proto types for all built-in scalars:** Walk all indexed types, registering messages and enums:
_UNSPECIFIEDzero-value as required by proto3_suffix viaIdentifierSchemaArtifactManagerExtensioninterceptsartifacts_from_schema_defto:schema.protoandproto_field_numbers.yamlalongside the existing JSON Schema artifactsExample Output
Given this schema definition:
The gem generates:
And a stable mapping file (
proto_field_numbers.yaml):Field Number Stability
The field-number mapping file ensures proto field numbers remain stable across schema changes:
This provides a simple, append-only stability guarantee for the proto wire format without requiring changes to the existing JSON Schema versioning mechanism.
Index Field Name Mapping
ElasticGraph uses
nameInIndexto allow the public-facing field name (as publishers and the GraphQL schema see it) to differ from the field name stored in the datastore index. This mapping is intentionally absent from public artifacts — publishers should not have access to internal field names. We've identified three approaches for handling this in proto:Option A: YAML Sidecar
Extend the
proto_field_numbers.yamlinternal artifact (already needed for field number stability) to carrynameInIndexoverrides:The generated
schema.protostays completely clean — no custom annotations, no EG-specific imports:At indexing time, the EG indexer reads the YAML to build a field-number-to-index-name lookup table, translating field names during deserialization.
json_schemas_by_version/vN.yamlcarriesnameInIndexas internal metadata separate from the publicjson_schemas.yamlOption B: Custom Proto Options
Use protobuf's custom options to annotate fields with their index name directly in
schema.proto:Custom options have zero wire format impact — serialized bytes are identical with or without annotations. Consumers who don't import
elastic_graph/options.protocan use the generated message code normally and never see the annotations. The EG indexer reads the option via Ruby's descriptor API at runtime.schema.protocontains theimportand visible[(elastic_graph)...]annotations, exposing internal field names in the source file even though consumers aren't functionally affectedgoogle-protobufgem >= 23.x (older versions silently drop custom options from the runtime descriptor). Options.protomust useproto2syntax for theextendkeyword; consuming files remainproto3.protofile, even though they don't affect the wire format or consumer code generationOption C: Separate Public and Internal Proto Files
Generate two
.protofiles with identical field numbers but different field names:schema.proto(public): Uses GraphQL field names. Given to publishers and external consumers.schema_internal.proto(private): UsesnameInIndexas field names. Used by the EG indexer for direct deserialization into index-ready records.Since proto uses field numbers on the wire (not names), both files produce identical serialized bytes. Publishers generate code from the public file; the indexer generates code from the internal file.
This can also be combined with Option B: the internal file carries custom option annotations (e.g., for additional EG metadata) while the public file remains completely clean. This gives the indexer both the direct index-ready field names and extensible metadata, without any of it leaking into the public contract.
.protoand uses it directlygoogle-protobuf>= 23.x only for the indexer, not publishers)bufbreaking checks) needs to know which file to validateRecommendation
We lean toward Option A (YAML sidecar) because it keeps the public proto artifact completely clean while reusing an internal artifact that already exists for field number stability. It also mirrors the existing JSON Schema pattern where
nameInIndexlives in internal artifacts only. However, we'd welcome input — particularly on whether proto ecosystem integration (Option B), the simplicity of direct code generation from separate files (Option C), or the hybrid approach (Option C + B) is more valuable for your use case.Breaking Change Detection
While protobuf's field-number system handles most schema evolution natively, some changes (e.g., changing a field's type incompatibly) can still break consumers. We integrate buf into the artifact dump flow:
schema_artifacts:dump, generate the new.protofilesbuf breakingagainst the previously dumped.proto.proto(analogous tojson_schemas_by_version/)bufis a prerequisite tool (likeprotoc), not a gem dependency, to keep the dependency footprint light.Phase 2: Proto-Based Ingestion as an Alternative Path
This phase adds protobuf as an alternative ingestion path into ElasticGraph's indexer, running alongside the existing JSON Schema pipeline. The goal is to allow publishers that produce protobuf events over Kafka to send those events directly to ElasticGraph without converting to JSON first — the same events that other consumers in the organization are already reading.
A key design goal is maximizing consumer reach: other consumers should be able to read the domain events as standard proto messages without needing any knowledge of ElasticGraph's internal conventions.
Current Architecture
One common deployment pattern ingests events via SQS → Lambda:
Events arrive as JSON Lines in SQS message bodies. The
SqsProcessorextracts SQS system attributes (SentTimestamp,messageId) and parses each JSON line as anElasticGraphEventEnvelopecontainingop,type,id,version, andrecord.Note: The core
elasticgraph-indexerarchitecture is transport-agnostic — it exposes a Ruby object for ingestion. SQS/Lambda is one deployment pattern; others (HTTP, Kafka/Pulsar consumers, etc.) are equally supported.The proto ingestion design must work with this architecture while also being transport-agnostic — supporting Kafka, Pulsar, SQS, and potentially other transports without requiring payload changes.
Event Metadata: Transport-Level Attributes vs Proto Envelope
ElasticGraph requires metadata alongside each domain event for indexing: the operation type (
op), the record type (type), a unique identifier (id), and a monotonic version for conflict resolution (version). The central design question is where this metadata lives.We evaluated three approaches:
Option A: Transport-level attributes for metadata, raw proto message as payload
The payload is the domain proto message itself — nothing wrapping it. ElasticGraph-specific metadata travels as transport-level key-value pairs: Kafka record headers, Pulsar message properties, or SQS message attributes. All transports support this natively.
On Kafka, headers are
byte[]key-value pairs on each record:On Pulsar, properties are
String → Stringkey-value pairs on each message:On SQS, message attributes are typed key-value pairs on each message:
{ "messageAttributes": { "eg_op": { "stringValue": "upsert", "dataType": "String" }, "eg_type": { "stringValue": "Widget", "dataType": "String" }, "eg_id": { "stringValue": "widget-123", "dataType": "String" }, "eg_version": { "stringValue": "7", "dataType": "String" } }, "body": "<base64-encoded Widget proto bytes>" }On any other transport (HTTP, gRPC, AMQP, etc.), the same pattern applies: metadata goes in the transport's native header/attribute mechanism, and the body is the raw proto bytes.
bytesblob. The registry can track and validate it.eg_typewithout touching the proto bytes.Tradeoff: Metadata does not travel inside the serialized payload. If the message is extracted from its transport context (e.g., written to a file, stored in a database blob), the metadata is lost unless explicitly preserved. This approach also does not support batching multiple events into a single transport message.
Option B:
oneofwrapper with all indexed typesElasticGraphEventEnvelopewrapper and check theoneofcase, even if they only care aboutWidget.oneofhandling (Scala pattern matching, Go switch).oneofvariant is backward-compatible — old consumers that don't recognize the new variant see theoneofas "not set" and skip it. No coordinated rollout needed.Tradeoff: Forces every consumer to couple to an ElasticGraph-specific envelope schema, which undermines the goal of making events consumable via the company's standard format.
Option C: Proto envelope with
bytes recordbytesfield usingtypeas a hint.bytesfield is opaque.Tradeoff: Loses the primary benefit of protobuf (type safety and schema validation) for the actual domain data. The research consensus is to avoid this pattern in favor of
google.protobuf.Anyat minimum, or transport-level metadata for best results.Recommendation: Support Both Approaches
After discussion, we recommend supporting both transport-level attributes and envelope-wrapped records:
ElasticGraphEventEnvelopeproto with aoneof recordcontaining all indexed types.The raw approach maximizes consumer reach (other consumers see standard proto messages with no EG-specific wrapper). The envelope approach supports efficient batching and preserves metadata alongside the payload.
The ingestion format is explicitly configured (e.g.,
ingestion_format: :envelopevs:raw) rather than auto-detected, to avoid ambiguity.Pulsar compatibility confirmed: Pulsar has first-class support for message-level properties (
String → Stringkey-value pairs), functionally equivalent to Kafka headers:The metadata attributes follow a simple, documented convention (with the ability to overwrite field names).
eg_op"upsert"(or"delete"in the future)eg_type"Widget")eg_ideg_versionThese attributes are carried as:
byte[]key-value pairs, UTF-8 encoded)String → Stringkey-value pairs — no duplicate keys)Any transport that supports key-value metadata alongside a binary payload can carry these events. The
eg_prefix avoids collisions with other headers/attributes.How Proto Deserialization Maps to
RecordPreparerThe current
RecordPreparerdoes three things that proto handles differently:RecordPreparerResponsibilitynameInIndexmapping)nameInIndexfor each field. The publicschema.protouses only the GraphQL field names that publishers see.DateTime→google.protobuf.Timestamp,JsonSafeLong→int64,Untyped→stringcontaining canonical JSON).google.protobuf.Structwas considered forUntypedbut rejected because it collapses all numbers todouble, losing the integer/float distinction thatUntypedEncoderpreserves — sinceUntypedfields are stored askeywordin the datastore,"3"and"3.0"are distinct values and conflating them would break equality filters. A full scalar mapping table will be documented in the implementation PR.Schema Evolution with Proto
With protobuf, schema evolution is governed by the proto3 compatibility rules:
int32↔int64is safe)For proto-based publishers, this means no coordination with ElasticGraph's
json_schema_versionmechanism is needed. Theproto_field_numbers.yamlfile is the sole source of truth for wire format stability on the proto side.Edge Cases
Several important edge cases:
TOAST) would collide. The generator prevents this by prefixing all enum values with theUPPER_SNAKE_CASEof the enum type name (e.g.,BREAD_TYPE_TOAST), following the official proto style guide andbuf lintdefault rules (ENUM_VALUE_PREFIX).[[String]]) generate synthetic wrapper messages likeMatrixValuesListLevel1_suffixOptional: JSON Schema Replacement Mode
For deployments that have fully migrated all publishers to protobuf and no longer need JSON Schema artifacts, the gem also supports
replace_json_schemas: true. This strips JSON Schema artifacts from the dump output and bypasses the JSON Schema version-bump checks. This is entirely opt-in and is not the expected usage.Benefits
.protofileAlternatives Considered
Maintain Separate Proto Schemas Manually
We considered hand-writing
.protofiles independently from ElasticGraph's schema definition. This was rejected because:.protofileApache Avro
Avro was considered as another binary serialization format with good Kafka ecosystem support. It was not chosen because:
FlatBuffers / Cap'n Proto
These zero-copy serialization formats offer even better performance, but were not chosen because:
Feedback Wanted
If you have interest in this feature, please review the proposal and let me know what you think! In particular:
.protoschema and any specific features needed for them?nameInIndexmapping option (A: YAML sidecar, B: custom proto options, C: separate files, or C+B hybrid) best fits your use case?Beta Was this translation helpful? Give feedback.
All reactions