Split batch on "request_size is too large" errors#3673
Split batch on "request_size is too large" errors#3673tonyhb wants to merge 1 commit intogoogle:masterfrom
Conversation
We currently estimate the size of the batch and limit batch sizes to <= 10MB for GCP PubSub (google#3005). This estimate can be wrong (by eg ~30kb). In these cases, we don't want data loss. Instead, if the batch is too large we split the batch in half then send both batches. This obviously assumes the estimate isn't off by a factor of 2, which we've never seen in prod. Also, ideally, we wouldn't even make the OG request and instead would inspect the size of the protobuf before sending over the wire. This requires changes to Google Cloud's apiv1 driver, so that's out of scope for gocloud.dev. In this case, we'll deal with a failing outbound request then attempt to recover.
|
For clarity, this is an example of a production error encountered using v0.40.0 of this library:
PR should fix this issue. |
| // estimates can be off, and some production use cases can have ~30KB data over the 10MB limit | ||
| // when sending batches. | ||
| // | ||
| // in this case, if we ever get "message too large" errors, split the batch in half and send |
There was a problem hiding this comment.
Do we have any way to test the splitting logic?
BrunoScheufler
left a comment
There was a problem hiding this comment.
Looks good. Wondering if we can test the splitting logic itself to make sure there's no weird index magic going on and we accidentally drop events because of it.
|
|
We can definitely lower the max batch size and yes, I dont know how I missed V2. Thanks. Want me to change this PR to decrease max sized in this PR? |
Sure. The intent was that this cap would make it so that we ~never hit the batch size, so if you're hitting it then lowering it a bit is probably the right thing to do, and is simpler than retrying after a split. I don't recall, but v2 doesn't have this cap, so it may not be an issue at all if you switch. |
We currently estimate the size of the batch and limit batch sizes to <= 10MB for GCP PubSub (#3005). This estimate can be wrong (by eg ~30kb).
In these cases, we don't want data loss. Instead, if the batch is too large we split the batch in half then send both batches. This obviously assumes the estimate isn't off by a factor of 2, which we've never seen in prod.
Also, ideally, we wouldn't even make the OG request and instead would inspect the size of the protobuf before sending over the wire. This requires changes to Google Cloud's apiv1 driver, so that's out of scope for gocloud.dev. In this case, we'll deal with a failing outbound request then attempt to recover.
edit:
This is specifically for GCP pub/sub. If the second half of sending fails, we'll attempt to retransmit the entire batch, leading to dupe messages for the first half. Given GCP pub/sub is at-least-once, I actually don't care about that. I'd much rather have no data loss and have dupe messages, as we already know we may get dupes.