Skip to content

Dicedb Opentelemetry integration with prometheus scraping#1462

Closed
gsarmaonline wants to merge 9 commits intodicedb:masterfrom
gsarmaonline:gsarma/opentel-prometheus
Closed

Dicedb Opentelemetry integration with prometheus scraping#1462
gsarmaonline wants to merge 9 commits intodicedb:masterfrom
gsarmaonline:gsarma/opentel-prometheus

Conversation

@gsarmaonline
Copy link
Copy Markdown
Contributor

@gsarmaonline gsarmaonline commented Feb 7, 2025

PR - #1135

First PR for integrating prometheus scraping using Opentelemetry exporter.

DiceDB Observability with Opentelemetry (Otel)

This module implements the observability layer required for DiceDB to emit metrics.
The current implementation integrates Opentelemetry with Prometheus in scraping mode.
The current Prometheus exporter also exposes a HTTP client which Prometheus can leverage
to scrape data from.

Refer to the README.md file in the PR for more information.

@gsarmaonline
Copy link
Copy Markdown
Contributor Author

cc @JyotinderSingh @arpitbbhayani

@gsarmaonline gsarmaonline force-pushed the gsarma/opentel-prometheus branch from ebc770b to c45d01b Compare February 8, 2025 06:27
Comment on lines +24 to +32
The observability module should be available from all other modules. There should always
be one instance of the observability module running.
To use any metric in the repo, first register it with the `diceotel` module [here](metrics.go).
```go
func (dotel *DiceOtel) register() (err error) {
DiceStartCounter, _ = dotel.Meter.Int64Counter("dicedb_start", api.WithDescription("A counter for the start of the DiceDB server"))
// Add other metrics here
return
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis could just be a global variable that gets initialized during the bootup.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics variable should be available globally but they should also be defined only after the DiceotelSrv object is defined.
I am thinking we can follow this approach:

  • Expose the DiceotelSrv globally to the dicedb package
  • Initialise the metrics inside the diceotel object and register them when the diceotel object is being initialised

This allows us to ensure that the list of global variables don't start piling up for metrics and can be accessed
through the DiceotelSrv global reference.

Comment thread internal/diceotel/README.md Outdated

Access it anywhere in the repo and call the required functions:
```go
diceotel.DiceStartCounter.Add(ctx, 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not keen on prefixing Dice for all metrics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, removing it.

Comment thread internal/diceotel/README.md Outdated
be one instance of the observability module running.
To use any metric in the repo, first register it with the `diceotel` module [here](metrics.go).
```go
func (dotel *DiceOtel) register() (err error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a huge fan of classes and objects here given this can just be a global variable.

Comment on lines +20 to +30
type (
DiceOtel struct {
ctx context.Context

Meter api.Meter
}
)

var (
DiceotelSrv *DiceOtel
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is typical java. I do not see a value add. If we need it, we can always add it later.

Comment thread internal/diceotel/diceotel.go Outdated
Comment on lines +36 to +41
if err = dotel.setup(); err != nil {
return
}
if err = dotel.register(); err != nil {
return
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't both of these be in one function, why have two unnecessarity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

// The exporter embeds a default OpenTelemetry Reader and
// implements prometheus.Collector, allowing it to be used as
// both a Reader and Collector.
exporter, err := prometheus.New()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we taking a strong dependency on prometheus here? given this is otel, this should be pluggable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case there is a requirement to move the metrics from one provider to another, changing the Exporters should be sufficient without having to changing the existing application code where the metrics are emitted.

So the prometheus exporter is pluggable and can be changed by adding another exporter type

Comment thread internal/server/ironhawk/iothread.go Outdated
_c.ClientID = t.ClientID
_c.Mode = t.Mode

execStartTime := time.Now()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we call this variable __telXXXTime, this way we can follow the same variable name through the codebase. Calling this execStartTime here and something else at another place will create confusion. Open for other shorter variable names.

We can also go for the convention that __telXXXTime is all things telemetry. So, any and every variable that starts with this can be easily identified as a telemetry variable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Comment thread internal/server/ironhawk/iothread.go Outdated
Comment on lines +54 to +62
diceotel.DiceCmdLatencyInMsHistogram.Record(
ctx,
time.Since(execStartTime).Milliseconds(),
metric.WithAttributeSet(
attribute.NewSet(
attribute.String("cmd", c.Cmd),
),
),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great UX as a database engineer. Liked it.

Concerns are only around the naming convention. Two things

  1. We should also add a non-docker tutorial on seeing these metrics in action, checked in the repository.
  2. why do we have histogram in the name? this seems pretty weird. The metric registering should be about what we are registering and not how it would be displayed or processed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For point 1, yes, let me add the non-docker tutorial as well for this.

For point 2, defining a metric as a histogram also adds more percentile wise information as well.
For example, the above command is represented in the following buckets:

➜  dice git:(gsarma/opentel-prometheus) curl http://localhost:8050/metrics | grep latency
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9966    0  9966    0     0  14.0M      0 --:--:-- --:--:-- --:--:-- 9732k
# HELP dicedb_command_exec_latency_milliseconds 
# TYPE dicedb_command_exec_latency_milliseconds histogram
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="0"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="5"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="10"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="25"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="50"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="75"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="100"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="250"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="500"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="750"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="1000"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="2500"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="5000"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="7500"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="10000"} 1
dicedb_command_exec_latency_milliseconds_bucket{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version="",le="+Inf"} 1
dicedb_command_exec_latency_milliseconds_sum{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version=""} 0
dicedb_command_exec_latency_milliseconds_count{cmd="GET",otel_scope_name="dicedb-otel",otel_scope_version=""} 1

So we primarily have counter, gauge and histograms as different metric types.

@gsarmaonline
Copy link
Copy Markdown
Contributor Author

@arpitbbhayani Thanks for the thorough review here.
I have tried to answer most of your comments. Please take a look whenever convenient.

@gsarmaonline
Copy link
Copy Markdown
Contributor Author

@arpitbbhayani I have updated the PR with the discussed changes. PTAL. Thanks!

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 20, 2025

CLA assistant check
All committers have signed the CLA.

@arpitbbhayani arpitbbhayani deleted the branch dicedb:master February 9, 2026 19:01
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants