Skip to content

Split batch on "request_size is too large" errors#3673

Open
tonyhb wants to merge 1 commit intogoogle:masterfrom
tonyhb:feat/split-batch-if-too-large
Open

Split batch on "request_size is too large" errors#3673
tonyhb wants to merge 1 commit intogoogle:masterfrom
tonyhb:feat/split-batch-if-too-large

Conversation

@tonyhb
Copy link

@tonyhb tonyhb commented Mar 18, 2026

We currently estimate the size of the batch and limit batch sizes to <= 10MB for GCP PubSub (#3005). This estimate can be wrong (by eg ~30kb).

In these cases, we don't want data loss. Instead, if the batch is too large we split the batch in half then send both batches. This obviously assumes the estimate isn't off by a factor of 2, which we've never seen in prod.

Also, ideally, we wouldn't even make the OG request and instead would inspect the size of the protobuf before sending over the wire. This requires changes to Google Cloud's apiv1 driver, so that's out of scope for gocloud.dev. In this case, we'll deal with a failing outbound request then attempt to recover.

edit:

This is specifically for GCP pub/sub. If the second half of sending fails, we'll attempt to retransmit the entire batch, leading to dupe messages for the first half. Given GCP pub/sub is at-least-once, I actually don't care about that. I'd much rather have no data loss and have dupe messages, as we already know we may get dupes.

We currently estimate the size of the batch and limit batch sizes to <=
10MB for GCP PubSub (google#3005).  This estimate can be wrong (by eg ~30kb).

In these cases, we don't want data loss.  Instead, if the batch is too
large we split the batch in half then send both batches.  This obviously
assumes the estimate isn't off by a factor of 2, which we've never seen
in prod.

Also, ideally, we wouldn't even make the OG request and instead would
inspect the size of the protobuf before sending over the wire.  This
requires changes to Google Cloud's apiv1 driver, so that's out of scope
for gocloud.dev.  In this case, we'll deal with a failing outbound
  request then attempt to recover.
@tonyhb
Copy link
Author

tonyhb commented Mar 18, 2026

For clarity, this is an example of a production error encountered using v0.40.0 of this library:

pubsub (code=InvalidArgument): rpc error: code = InvalidArgument desc = The value for request_size is too large. You passed 10036929 in the request, but the maximum value is 10000000..

PR should fix this issue.

// estimates can be off, and some production use cases can have ~30KB data over the 10MB limit
// when sending batches.
//
// in this case, if we ever get "message too large" errors, split the batch in half and send

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any way to test the splitting logic?

Copy link

@BrunoScheufler BrunoScheufler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Wondering if we can test the splitting logic itself to make sure there's no weird index magic going on and we accidentally drop events because of it.

@vangent
Copy link
Contributor

vangent commented Mar 18, 2026

  1. Have you considered moving to gcppubsubv2 ?
  2. Maybe instead we can just change the MaxBatchByteSize to 8MB instead of 9MB?

@tonyhb
Copy link
Author

tonyhb commented Mar 18, 2026

We can definitely lower the max batch size and yes, I dont know how I missed V2. Thanks. Want me to change this PR to decrease max sized in this PR?

@vangent
Copy link
Contributor

vangent commented Mar 18, 2026

Want me to change this PR to decrease max sized in this PR?

Sure. The intent was that this cap would make it so that we ~never hit the batch size, so if you're hitting it then lowering it a bit is probably the right thing to do, and is simpler than retrying after a split.

I don't recall, but v2 doesn't have this cap, so it may not be an issue at all if you switch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants