pkg/chipingress: batch client with metrics and request splitting#2058
pkg/chipingress: batch client with metrics and request splitting#2058pkcll wants to merge 1 commit into
Conversation
✅ API Diff Results -
|
9e14f91 to
73fc379
Compare
73fc379 to
b54eb82
Compare
hendoxc
left a comment
There was a problem hiding this comment.
can we collapse send_requests_total (does it need to be prefixed like chip_ingress_batch_client_send_requests_total) and add a status label, so we have a single metric for failed and success sends
There was a problem hiding this comment.
Pull request overview
This PR enhances the pkg/chipingress/batch client to be more production-ready by adding OpenTelemetry metrics, enforcing/splitting by gRPC request size, and improving shutdown behavior by closing the underlying chipingress.Client and using a standalone shutdown timeout context.
Changes:
- Added OTel metrics for batch send counts, failures, sizes, latency, and a config info gauge.
- Implemented request splitting/rejection based on serialized
PublishBatchrequest size, plus regression tests. - Updated shutdown to close the underlying chip ingress client and to use a standalone timeout context during drain.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/chipingress/go.mod | Adds direct OTel deps needed for new metrics + metric test utilities. |
| pkg/chipingress/client_test.go | Adds a Close() test for the chip ingress client. |
| pkg/chipingress/batch/client.go | Adds metrics, request-size splitting/rejection, and updates shutdown/close behavior. |
| pkg/chipingress/batch/client_test.go | Adds metric tests, oversize/splitting regression tests, and Stop() close expectations. |
Comments suppressed due to low confidence (1)
pkg/chipingress/batch/client.go:346
- The docstring for
WithMaxGRPCRequestSizesays it’s only used for metric comparison attributes, but it now also drives request splitting and oversize rejection insendBatch. Please update the comment to reflect the behavioral impact so callers don’t treat it as metrics-only.
// WithMaxGRPCRequestSize sets the max gRPC request size in bytes used for metric comparison attributes.
func WithMaxGRPCRequestSize(maxReqSize int) Opt {
return func(c *Client) {
c.maxGRPCRequestSize = maxReqSize
}
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| client: client, | ||
| log: zap.NewNop().Sugar(), | ||
| batchSize: 10, | ||
| maxGRPCRequestSize: 10 * 1024 * 1024, // Match chipingress maxMessageSize default. |
There was a problem hiding this comment.
Updated misleading comment
| if err := b.client.Close(); err != nil { | ||
| b.log.Warnw("failed to close chip ingress client", "error", err) |
There was a problem hiding this comment.
it doesn't need a nil guard
| var batches [][]*messageWithCallback | ||
| current := make([]*messageWithCallback, 0, len(messages)) | ||
| for _, msg := range messages { | ||
| candidate := append(current, msg) | ||
| _, candidateBytes := newBatchRequest(candidate) | ||
| if len(current) > 0 && candidateBytes > maxRequestSize { | ||
| batches = append(batches, current) | ||
| current = []*messageWithCallback{msg} | ||
| continue | ||
| } | ||
| current = candidate | ||
| } |
| b.shutdownOnce.Do(func() { | ||
| ctx, cancel := b.stopCh.CtxWithTimeout(b.shutdownTimeout) | ||
| // Use a standalone timeout context so the shutdown wait isn't cancelled | ||
| // by close(b.stopCh) below. | ||
| ctx, cancel := context.WithTimeout(context.Background(), b.shutdownTimeout) | ||
| defer cancel() | ||
|
|
||
| if b.cancelBatcher != nil { |
Introduce ChipIngressBatchEmitterService backed by chipingress batch client, managed as a sub-service of the beholder Client via services.Engine. Refactor DualSourceEmitter to delegate fire-and-forget to ChipIngressEmitter. Pass explicit loggers throughout instead of creating new ones internally. Add batch emitter config fields, feature flag, tests, and benchmarks. Update pkg/loop server and config to propagate batch emitter settings. Depends on #2058 (pkg/chipingress batch client metrics).
b54eb82 to
044d3dd
Compare
712898c to
09be58d
Compare
Add observability metrics to the batch client using OpenTelemetry: - send_requests_total (counter with status=success|failure attribute) - request_size_messages (histogram with batch_size attribute) - request_size_bytes (histogram with max_grpc_request_size_bytes attribute) - request_latency_ms (histogram with status attribute) - config.info (gauge recording batch configuration at startup) Shutdown improvements: - Close the underlying chipingress.Client in Stop() - Use a standalone timeout context for the shutdown drain so it is not cancelled prematurely by close(stopCh) - Remove closeOnce guard from client.Close (shutdownOnce already serialises) Bug fixes: - Pass caller-provided ctx to recordConfig instead of context.Background() - Remove redundant send_failures_total counter; send_requests_total with its status label already captures failure counts - Remove misleading comment on maxGRPCRequestSize default (10MB matches the chip-ingress server MaxRecvMsgSize, not the client-side 16MB maxMessageSize used for MaxCallRecvMsgSize) Test improvements: - Replace nil client in 17 test call sites with mocks.NewClient(t); nil is not a valid chipingress.Client and would panic on Stop() - Add WithMaxGRPCRequestSize option and oversize-event metrics test - Add config.info gauge assertion
09be58d to
fddcb86
Compare
Summary
Add OTel metrics to the batch client, fix shutdown behavior, and harden tests.
Metrics
chip_ingress.batch.send_requests_total— counter withstatus=success|failureattributechip_ingress.batch.request_size_messages— histogram withbatch_sizeattributechip_ingress.batch.request_size_bytes— histogram withmax_grpc_request_size_bytesattributechip_ingress.batch.request_latency_ms— histogram withstatusattributechip_ingress.batch.config.info— gauge recording batch configuration at startupShutdown improvements
chipingress.ClientinStop()close(stopCh)closeOnceguard fromclient.Close(shutdownOncealready serialises)Bug fixes
ctxtorecordConfiginstead ofcontext.Background()send_failures_totalcounter;send_requests_totalwith its status label already captures failure countsmaxGRPCRequestSizedefault (10MB matches the chip-ingress serverMaxRecvMsgSize, not the client-side 16MBmaxMessageSizeused forMaxCallRecvMsgSize)Test improvements
nilclient in 17 test call sites withmocks.NewClient(t)—nilis not a validchipingress.Clientand would panic onStop()WithMaxGRPCRequestSizeoption and oversize-event metrics testconfig.infogauge assertion