Skip to content

pkg/chipingress: batch client with metrics and request splitting#2058

Open
pkcll wants to merge 1 commit into
mainfrom
infoplat-3436-chipingress-batching-part-1
Open

pkg/chipingress: batch client with metrics and request splitting#2058
pkcll wants to merge 1 commit into
mainfrom
infoplat-3436-chipingress-batching-part-1

Conversation

@pkcll
Copy link
Copy Markdown
Contributor

@pkcll pkcll commented May 14, 2026

Summary

Add OTel metrics to the batch client, fix shutdown behavior, and harden tests.

Metrics

  • chip_ingress.batch.send_requests_total — counter with status=success|failure attribute
  • chip_ingress.batch.request_size_messages — histogram with batch_size attribute
  • chip_ingress.batch.request_size_bytes — histogram with max_grpc_request_size_bytes attribute
  • chip_ingress.batch.request_latency_ms — histogram with status attribute
  • chip_ingress.batch.config.info — gauge recording batch configuration at startup

Shutdown improvements

  • Close the underlying chipingress.Client in Stop()
  • Use a standalone timeout context for the shutdown drain so it is not cancelled prematurely by close(stopCh)
  • Remove closeOnce guard from client.Close (shutdownOnce already serialises)

Bug fixes

  • Pass caller-provided ctx to recordConfig instead of context.Background()
  • Remove redundant send_failures_total counter; send_requests_total with its status label already captures failure counts
  • Remove misleading comment on maxGRPCRequestSize default (10MB matches the chip-ingress server MaxRecvMsgSize, not the client-side 16MB maxMessageSize used for MaxCallRecvMsgSize)

Test improvements

  • Replace nil client in 17 test call sites with mocks.NewClient(t)nil is not a valid chipingress.Client and would panic on Stop()
  • Add WithMaxGRPCRequestSize option and oversize-event metrics test
  • Add config.info gauge assertion

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

✅ API Diff Results - github.com/smartcontractkit/chainlink-common/pkg/chipingress

✅ Compatible Changes (1)

batch (1)
  • WithMaxGRPCRequestSize — ➕ Added

📄 View full apidiff report

@pkcll pkcll force-pushed the infoplat-3436-chipingress-batching-part-1 branch 3 times, most recently from 9e14f91 to 73fc379 Compare May 14, 2026 17:32
@pkcll pkcll changed the title pkg/chipingress: add batch client metrics, close underlying client on Stop pkg/chipingress: batch client with metrics and request splitting May 14, 2026
@pkcll pkcll force-pushed the infoplat-3436-chipingress-batching-part-1 branch from 73fc379 to b54eb82 Compare May 14, 2026 17:34
@pkcll pkcll marked this pull request as ready for review May 14, 2026 17:35
@pkcll pkcll requested a review from a team as a code owner May 14, 2026 17:35
Copilot AI review requested due to automatic review settings May 14, 2026 17:35
@pkcll pkcll requested review from hendoxc and jmank88 May 14, 2026 17:38
Copy link
Copy Markdown
Contributor

@hendoxc hendoxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we collapse send_requests_total (does it need to be prefixed like chip_ingress_batch_client_send_requests_total) and add a status label, so we have a single metric for failed and success sends

Comment thread pkg/chipingress/batch/client.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the pkg/chipingress/batch client to be more production-ready by adding OpenTelemetry metrics, enforcing/splitting by gRPC request size, and improving shutdown behavior by closing the underlying chipingress.Client and using a standalone shutdown timeout context.

Changes:

  • Added OTel metrics for batch send counts, failures, sizes, latency, and a config info gauge.
  • Implemented request splitting/rejection based on serialized PublishBatch request size, plus regression tests.
  • Updated shutdown to close the underlying chip ingress client and to use a standalone timeout context during drain.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
pkg/chipingress/go.mod Adds direct OTel deps needed for new metrics + metric test utilities.
pkg/chipingress/client_test.go Adds a Close() test for the chip ingress client.
pkg/chipingress/batch/client.go Adds metrics, request-size splitting/rejection, and updates shutdown/close behavior.
pkg/chipingress/batch/client_test.go Adds metric tests, oversize/splitting regression tests, and Stop() close expectations.
Comments suppressed due to low confidence (1)

pkg/chipingress/batch/client.go:346

  • The docstring for WithMaxGRPCRequestSize says it’s only used for metric comparison attributes, but it now also drives request splitting and oversize rejection in sendBatch. Please update the comment to reflect the behavioral impact so callers don’t treat it as metrics-only.
// WithMaxGRPCRequestSize sets the max gRPC request size in bytes used for metric comparison attributes.
func WithMaxGRPCRequestSize(maxReqSize int) Opt {
	return func(c *Client) {
		c.maxGRPCRequestSize = maxReqSize
	}
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/chipingress/batch/client.go Outdated
client: client,
log: zap.NewNop().Sugar(),
batchSize: 10,
maxGRPCRequestSize: 10 * 1024 * 1024, // Match chipingress maxMessageSize default.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated misleading comment

Comment on lines +173 to +174
if err := b.client.Close(); err != nil {
b.log.Warnw("failed to close chip ingress client", "error", err)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't need a nil guard

Comment on lines +307 to +318
var batches [][]*messageWithCallback
current := make([]*messageWithCallback, 0, len(messages))
for _, msg := range messages {
candidate := append(current, msg)
_, candidateBytes := newBatchRequest(candidate)
if len(current) > 0 && candidateBytes > maxRequestSize {
batches = append(batches, current)
current = []*messageWithCallback{msg}
continue
}
current = candidate
}
Comment thread pkg/chipingress/batch/client.go Outdated
Comment on lines 141 to 147
b.shutdownOnce.Do(func() {
ctx, cancel := b.stopCh.CtxWithTimeout(b.shutdownTimeout)
// Use a standalone timeout context so the shutdown wait isn't cancelled
// by close(b.stopCh) below.
ctx, cancel := context.WithTimeout(context.Background(), b.shutdownTimeout)
defer cancel()

if b.cancelBatcher != nil {
Copy link
Copy Markdown
Contributor Author

@pkcll pkcll May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

pkcll added a commit that referenced this pull request May 14, 2026
Introduce ChipIngressBatchEmitterService backed by chipingress batch client,
managed as a sub-service of the beholder Client via services.Engine. Refactor
DualSourceEmitter to delegate fire-and-forget to ChipIngressEmitter. Pass
explicit loggers throughout instead of creating new ones internally. Add
batch emitter config fields, feature flag, tests, and benchmarks.

Update pkg/loop server and config to propagate batch emitter settings.

Depends on #2058 (pkg/chipingress batch client metrics).
@pkcll pkcll force-pushed the infoplat-3436-chipingress-batching-part-1 branch from b54eb82 to 044d3dd Compare May 14, 2026 17:54
jmank88
jmank88 previously approved these changes May 14, 2026
@pkcll pkcll force-pushed the infoplat-3436-chipingress-batching-part-1 branch from 712898c to 09be58d Compare May 14, 2026 21:35
Add observability metrics to the batch client using OpenTelemetry:
- send_requests_total (counter with status=success|failure attribute)
- request_size_messages (histogram with batch_size attribute)
- request_size_bytes (histogram with max_grpc_request_size_bytes attribute)
- request_latency_ms (histogram with status attribute)
- config.info (gauge recording batch configuration at startup)

Shutdown improvements:
- Close the underlying chipingress.Client in Stop()
- Use a standalone timeout context for the shutdown drain so it is not
  cancelled prematurely by close(stopCh)
- Remove closeOnce guard from client.Close (shutdownOnce already
  serialises)

Bug fixes:
- Pass caller-provided ctx to recordConfig instead of
  context.Background()
- Remove redundant send_failures_total counter; send_requests_total
  with its status label already captures failure counts
- Remove misleading comment on maxGRPCRequestSize default (10MB
  matches the chip-ingress server MaxRecvMsgSize, not the client-side
  16MB maxMessageSize used for MaxCallRecvMsgSize)

Test improvements:
- Replace nil client in 17 test call sites with mocks.NewClient(t);
  nil is not a valid chipingress.Client and would panic on Stop()
- Add WithMaxGRPCRequestSize option and oversize-event metrics test
- Add config.info gauge assertion
@pkcll pkcll force-pushed the infoplat-3436-chipingress-batching-part-1 branch from 09be58d to fddcb86 Compare May 14, 2026 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants