-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[n8n] Fix metric mappings and add full v2 metric coverage #23635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
7f24059
2991295
12f3122
8523188
1be3b3d
af60d11
43e7fc8
3db752d
fc4db3d
66e5dc3
7d4e58c
8c3703a
e8dfb08
9418e2d
8ec545d
d0b3a90
1f407d5
fbe1dd8
2844832
c5cae88
a67c26b
cc56e02
a0e259b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -2,15 +2,15 @@ | |||||
|
|
||||||
| ## Overview | ||||||
|
|
||||||
| This check monitors [n8n][1] through the Datadog Agent. | ||||||
| This check monitors [n8n][1] through the Datadog Agent. | ||||||
|
|
||||||
| Collect n8n metrics including: | ||||||
| - Cache metrics: Hit and miss statistics. | ||||||
| - Message event bus metrics: Event-related metrics. | ||||||
| - Workflow metrics: Can include workflow ID labels. | ||||||
| - Node metrics: Can include node type labels. | ||||||
| - Credential metrics: Can include credential type labels. | ||||||
| - Queue metrics | ||||||
| - Cache metrics: hit, miss, and update counts. | ||||||
| - Workflow metrics: started, success, failed counters, audit workflow lifecycle counters; in n8n 2.x, an execution-duration histogram. | ||||||
| - Node metrics: per-node started and finished counters emitted by worker processes in queue mode. | ||||||
| - Queue metrics: queue depth; enqueued, dequeued, completed, failed, and stalled counters; and scaling-mode worker gauges. | ||||||
| - HTTP metrics: request duration histograms tagged with status code. | ||||||
| - Process and Node.js runtime metrics. | ||||||
|
|
||||||
|
|
||||||
| ## Setup | ||||||
|
|
@@ -40,13 +40,79 @@ N8N_METRICS_INCLUDE_CACHE_METRICS=true | |||||
| N8N_METRICS_INCLUDE_MESSAGE_EVENT_BUS_METRICS=true | ||||||
| N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL=true | ||||||
| N8N_METRICS_INCLUDE_API_ENDPOINTS=true | ||||||
| N8N_METRICS_INCLUDE_QUEUE_METRICS=true | ||||||
|
|
||||||
| # Optional: n8n 2.x adds workflow_statistics gauges (workflows, users, executions, ...) - opt in | ||||||
| N8N_METRICS_INCLUDE_WORKFLOW_STATISTICS=true | ||||||
|
|
||||||
| # Optional: Customize the metric prefix (default is 'n8n_') | ||||||
| N8N_METRICS_PREFIX=n8n_ | ||||||
| ``` | ||||||
|
|
||||||
| For more details, see the n8n documentation on [enabling Prometheus metrics][10]. | ||||||
|
|
||||||
| If you change `N8N_METRICS_PREFIX` from its default of `n8n_`, you **must** also set `raw_metric_prefix` in the integration's `conf.yaml` to the same value. Otherwise the check will not recognize the exposed metric names and will silently submit nothing: | ||||||
|
|
||||||
| ```yaml | ||||||
| instances: | ||||||
| - openmetrics_endpoint: http://localhost:5678/metrics | ||||||
| raw_metric_prefix: my_custom_prefix_ | ||||||
| ``` | ||||||
|
|
||||||
| #### Event-driven counters | ||||||
|
|
||||||
| Most n8n counters are registered dynamically the first time their underlying event fires. The integration ships mappings for around 70 of these event-bus counters, including: | ||||||
|
|
||||||
| - Workflow lifecycle: `n8n.workflow.started.count`, `n8n.workflow.success.count`, `n8n.workflow.failed.count`, `n8n.workflow.cancelled.count` | ||||||
| - Audit (workflow, user, credentials, package, variable, execution data): `n8n.audit.workflow.executed.count`, `n8n.audit.user.login.success.count`, `n8n.audit.user.credentials.created.count`, and similar | ||||||
| - AI nodes: `n8n.ai.tool.called.count`, `n8n.ai.llm.generated.count`, `n8n.ai.vector.store.searched.count`, and similar | ||||||
| - Runner, queue, and node lifecycle: `n8n.runner.task.requested.count`, `n8n.queue.job.completed.count`, `n8n.node.started.count`, `n8n.node.finished.count` | ||||||
|
|
||||||
| These counters do not appear on the `/metrics` endpoint until the corresponding event has occurred. A healthy idle deployment will not produce data points for them until that activity fires. The complete list is in [`metadata.csv`][7]. | ||||||
|
|
||||||
| If a future n8n release exposes a new event-driven counter that is not yet covered by this integration, add it to the `extra_metrics` option in your instance configuration: | ||||||
|
|
||||||
| ```yaml | ||||||
| instances: | ||||||
| - openmetrics_endpoint: http://n8n:5678/metrics | ||||||
| extra_metrics: | ||||||
| - some_new_n8n_event_total: some.new.n8n.event | ||||||
| ``` | ||||||
|
|
||||||
| The left-hand side is the Prometheus counter name as n8n exposes it (keep the `_total` suffix). The right-hand side is the dotted Datadog metric name to submit it as. | ||||||
|
|
||||||
| #### Queue mode and workers | ||||||
|
|
||||||
| In queue mode, n8n runs separate worker processes that execute jobs picked up from a Redis-backed queue. Each worker exposes its own `/metrics` endpoint and emits a different subset of metrics than the main process. Worker-observed metrics include `n8n.queue.job.dequeued.count`, `n8n.queue.job.stalled.count`, `n8n.node.started.count`, `n8n.node.finished.count`, and `n8n.runner.task.requested.count`. Main-only metrics include `n8n.instance.role.leader` and the `n8n.scaling.mode.queue.jobs.*` family. | ||||||
|
|
||||||
| To expose worker metrics, set `QUEUE_HEALTH_CHECK_ACTIVE=true` and `QUEUE_HEALTH_CHECK_PORT=<port>` on each worker. **In n8n 2.x, port `5679` is reserved for the task runner broker, so pick a different port (for example `5680`).** | ||||||
|
|
||||||
| For full coverage in queue deployments, configure one Datadog instance per n8n process exposing `/metrics`, including main and worker processes: | ||||||
|
|
||||||
| ```yaml | ||||||
| instances: | ||||||
| - openmetrics_endpoint: http://n8n-main:5678/metrics | ||||||
| - openmetrics_endpoint: http://n8n-worker:5680/metrics | ||||||
| ``` | ||||||
|
|
||||||
| #### Version-specific metrics | ||||||
|
|
||||||
| Several metric families were introduced in n8n 2.x and are not emitted on n8n 1.x: | ||||||
|
|
||||||
| - `n8n.workflow.execution.duration.seconds.*` (histogram). Gated by `N8N_METRICS_INCLUDE_WORKFLOW_EXECUTION_DURATION`, which defaults to `true` in n8n 2.x. | ||||||
| - `n8n.audit.workflow.activated.count`, `n8n.audit.workflow.deactivated.count`, `n8n.audit.workflow.executed.count`, `n8n.audit.workflow.resumed.count`, `n8n.audit.workflow.version.updated.count`, and `n8n.audit.workflow.waiting.count` | ||||||
| - `n8n.embed.login.requests.count` (tagged with `result:success` or `result:failure`), `n8n.embed.login.failures.count` (tagged with `reason`) | ||||||
| - `n8n.token.exchange.requests.count` (tagged with `result:success` or `result:failure`), `n8n.token.exchange.failures.count` (tagged with `reason`), `n8n.token.exchange.identity.linked.count`, `n8n.token.exchange.jit.provisioning.count` | ||||||
| - `n8n.process.pss.bytes` (Linux only) | ||||||
| - The `n8n.{production,manual,production.root}.executions`, `n8n.users.total`, `n8n.enabled.users`, `n8n.workflows.total`, and `n8n.credentials.total` family. Only emitted when `N8N_METRICS_INCLUDE_WORKFLOW_STATISTICS=true` is set. | ||||||
| - The `n8n.expression.*` family (`evaluation.duration.seconds`, `code.cache.{hit,miss,eviction,size}`, `pool.{acquired,replenish.failed,scaled.up,scaled.to.zero}`). Only emitted when n8n is running the new VM-isolated expression engine *and* observability for it is on. Set `N8N_EXPRESSION_ENGINE=vm` and `N8N_EXPRESSION_ENGINE_OBSERVABILITY_ENABLED=true` on the n8n process; both default to off (the engine defaults to `legacy`). These metrics surface the per-expression evaluation latency, the compiled-expression LRU cache hit and miss rates, and the V8-isolate pool's idle scaling behavior. They are most useful for troubleshooting workflow latency that traces back to slow `{{ ... }}` evaluation. | ||||||
|
|
||||||
| Some metrics only emit samples after the corresponding runtime event occurs. For example, failures-only counters (`*.failures.count`) need an authentication failure, audit workflow counters need the matching workflow state transition, and the libuv `n8n.nodejs.active.requests` gauge needs an in-flight libuv request. A healthy idle deployment may not produce data points for these metrics until that activity occurs. | ||||||
|
|
||||||
| #### Tag cardinality | ||||||
|
|
||||||
| When `N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL=true`, http and workflow execution histograms are tagged with `workflow_id` (and similar labels for nodes). On deployments with many distinct workflows or nodes, this can produce high-cardinality metrics. Drop the label via `exclude_labels` or omit `N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL` to keep tag cardinality bounded. | ||||||
|
|
||||||
| #### Configure the Datadog Agent | ||||||
|
|
||||||
| 1. Edit the `n8n.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your n8n performance data. See the [sample n8n.d/conf.yaml][4] for all available configuration options. | ||||||
|
|
@@ -59,27 +125,32 @@ _Available for Agent versions >6.0_ | |||||
|
|
||||||
| #### Enable n8n logging | ||||||
|
|
||||||
| Configure n8n to output logs by setting the following environment variables: | ||||||
| Configure n8n application logs by setting the following environment variables: | ||||||
|
|
||||||
| ```bash | ||||||
| # Set the log level (error, warn, info, debug) | ||||||
| N8N_LOG_LEVEL=info | ||||||
|
|
||||||
| # Output logs to console (for containerized environments) or file | ||||||
| # Output application logs to console or file | ||||||
| N8N_LOG_OUTPUT=console | ||||||
|
|
||||||
| # If using file output, specify the log file location | ||||||
| # Use JSON formatting so Datadog can parse n8n application log attributes | ||||||
| N8N_LOG_FORMAT=json | ||||||
|
|
||||||
| # If using file output, specify the application log file location | ||||||
| N8N_LOG_FILE_LOCATION=/var/log/n8n/n8n.log | ||||||
| ``` | ||||||
|
|
||||||
| #### Structured event logs | ||||||
|
|
||||||
| n8n can output structured JSON logs to `n8nEventLog.log` containing detailed workflow execution events. Enable this by setting the log output to file: | ||||||
| n8n also writes structured event bus logs to `n8nEventLog*.log`. These logs contain workflow, node, queue, runner, and audit events and are separate from the application logs controlled by `N8N_LOG_OUTPUT` and `N8N_LOG_FILE_LOCATION`. | ||||||
|
|
||||||
| ```bash | ||||||
| N8N_LOG_OUTPUT=file | ||||||
| N8N_LOG_FILE_LOCATION=/var/log/n8n/ | ||||||
| ``` | ||||||
| By default, event bus log files are written under the n8n user folder, for example: | ||||||
|
|
||||||
| - Host installations: `~/.n8n/n8nEventLog*.log` | ||||||
| - Official Docker image: `/home/node/.n8n/n8nEventLog*.log` | ||||||
|
|
||||||
| If you use a custom n8n user folder, collect the event bus logs from that folder instead. If you customize the event bus log file base name with `N8N_EVENTBUS_LOGWRITER_LOGBASENAME`, update the Datadog log path to match. | ||||||
|
|
||||||
| The event log includes the following event types: | ||||||
|
|
||||||
|
|
@@ -102,32 +173,46 @@ Each event contains rich metadata including `executionId`, `workflowId`, `workfl | |||||
| logs_enabled: true | ||||||
| ``` | ||||||
|
|
||||||
| 2. Add this configuration block to your `n8n.d/conf.yaml` file to start collecting your n8n logs: | ||||||
| 2. Add log collection entries to your `n8n.d/conf.yaml` file. | ||||||
|
|
||||||
| For a host-based n8n installation where the Agent can read local files, collect the application log file and the event bus log files: | ||||||
|
|
||||||
| ```yaml | ||||||
| logs: | ||||||
| - type: file | ||||||
| path: /var/log/n8n/*.log | ||||||
| source: n8n | ||||||
| service: n8n | ||||||
| service: <SERVICE> | ||||||
| - type: file | ||||||
| path: /home/n8n/.n8n/n8nEventLog*.log | ||||||
| source: n8n | ||||||
| service: <SERVICE> | ||||||
| ``` | ||||||
|
|
||||||
| For containerized environments using Docker, use the following configuration instead: | ||||||
| Adjust `/home/n8n/.n8n/n8nEventLog*.log` to the n8n user folder on your host. | ||||||
|
|
||||||
| For a containerized n8n deployment, collect stdout and stderr from the n8n container for application logs, and make the n8n user folder available to the Agent for event bus file logs. For example, if the n8n data directory is mounted on the host at `/var/lib/n8n`, configure: | ||||||
|
|
||||||
| ```yaml | ||||||
| logs: | ||||||
| - type: docker | ||||||
| source: n8n | ||||||
| service: n8n | ||||||
| service: <SERVICE> | ||||||
| - type: file | ||||||
| path: /var/lib/n8n/n8nEventLog*.log | ||||||
| source: n8n | ||||||
| service: <SERVICE> | ||||||
| ``` | ||||||
|
|
||||||
| If the Agent runs in a container, mount the n8n data volume or host directory into the Agent container and use the path as seen from inside the Agent container. | ||||||
|
|
||||||
| 3. [Restart the Agent][5]. | ||||||
|
|
||||||
| ### Validation | ||||||
|
|
||||||
| [Run the Agent's status subcommand][6] and look for `n8n` under the Checks section. | ||||||
|
|
||||||
| ## Data Collected | ||||||
| ## Data collected | ||||||
|
|
||||||
| ### Metrics | ||||||
|
|
||||||
|
|
@@ -137,7 +222,7 @@ See [metadata.csv][7] for a list of metrics provided by this integration. | |||||
|
|
||||||
| The n8n integration does not include any events. | ||||||
|
|
||||||
| ### Service Checks | ||||||
| ### Service checks | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as for the comment above. |
||||||
|
|
||||||
| See [service_checks.json][8] for a list of service checks provided by this integration. | ||||||
|
|
||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| Improve the n8n metric coverage: | ||
|
|
||
| - Correct missing or incorrect metrics. | ||
| - Add metrics introduced in n8n 2.x (workflow execution duration, audit events, authentication, workflow and user statistics, expression engine, and process memory). | ||
| - Track n8n's dynamic events (workflow cancellations, audit activity, AI nodes, user and credential changes, package and variable changes). | ||
| - Add support for monitoring n8n worker processes alongside the main process. |
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -2,58 +2,55 @@ | |||||||||
| # All rights reserved | ||||||||||
| # Licensed under a 3-clause BSD style license (see LICENSE) | ||||||||||
|
|
||||||||||
| from urllib.parse import urljoin | ||||||||||
| from functools import cached_property | ||||||||||
| from typing import Any | ||||||||||
| from urllib.parse import urljoin, urlparse | ||||||||||
|
|
||||||||||
| from requests.exceptions import RequestException | ||||||||||
|
|
||||||||||
| from datadog_checks.base import OpenMetricsBaseCheckV2 | ||||||||||
| from datadog_checks.n8n.metrics import METRIC_MAP, RENAME_LABELS_MAP | ||||||||||
|
|
||||||||||
| from .config_models import ConfigMixin | ||||||||||
|
|
||||||||||
| DEFAULT_READY_ENDPOINT = '/healthz/readiness' | ||||||||||
| DEFAULT_READY_PATH = '/healthz/readiness' | ||||||||||
|
|
||||||||||
|
|
||||||||||
| class N8nCheck(OpenMetricsBaseCheckV2, ConfigMixin): | ||||||||||
| __NAMESPACE__ = 'n8n' | ||||||||||
| DEFAULT_METRIC_LIMIT = 0 | ||||||||||
|
|
||||||||||
| def __init__(self, name, init_config, instances=None): | ||||||||||
| super(N8nCheck, self).__init__( | ||||||||||
| name, | ||||||||||
| init_config, | ||||||||||
| instances, | ||||||||||
| ) | ||||||||||
| self.openmetrics_endpoint = self.instance["openmetrics_endpoint"] | ||||||||||
| self.tags = self.instance.get('tags', []) | ||||||||||
| self._ready_endpoint = DEFAULT_READY_ENDPOINT | ||||||||||
|
|
||||||||||
| def get_default_config(self): | ||||||||||
| def get_default_config(self) -> dict[str, Any]: | ||||||||||
| return { | ||||||||||
| 'metrics': [METRIC_MAP], | ||||||||||
| 'rename_labels': RENAME_LABELS_MAP, | ||||||||||
| 'raw_metric_prefix': 'n8n_', | ||||||||||
| } | ||||||||||
|
|
||||||||||
| def _check_n8n_readiness(self): | ||||||||||
| endpoint = urljoin(self.openmetrics_endpoint, self._ready_endpoint) | ||||||||||
| response = self.http.get(endpoint) | ||||||||||
|
|
||||||||||
| # Determine metric value and status_code tag | ||||||||||
| if response.status_code is None: | ||||||||||
| self.log.warning("The readiness endpoint did not return a status code") | ||||||||||
| metric_value = 0 | ||||||||||
| metric_tags = self.tags + ['status_code:null'] | ||||||||||
| elif response.status_code == 200: | ||||||||||
| # Ready - submit 1 | ||||||||||
| metric_value = 1 | ||||||||||
| metric_tags = self.tags + [f'status_code:{response.status_code}'] | ||||||||||
| else: | ||||||||||
| # Not ready - submit 0 | ||||||||||
| metric_value = 0 | ||||||||||
| metric_tags = self.tags + [f'status_code:{response.status_code}'] | ||||||||||
|
|
||||||||||
| # Submit metric with appropriate value and status_code tag | ||||||||||
| self.gauge('readiness.check', metric_value, tags=metric_tags) | ||||||||||
|
|
||||||||||
| def check(self, instance): | ||||||||||
| super().check(instance) | ||||||||||
| @cached_property | ||||||||||
| def _readiness_endpoint(self) -> str: | ||||||||||
| parsed = urlparse(self.config.openmetrics_endpoint) | ||||||||||
| base = f'{parsed.scheme}://{parsed.netloc}' | ||||||||||
| return urljoin(base, DEFAULT_READY_PATH) | ||||||||||
|
|
||||||||||
| def _check_n8n_readiness(self) -> None: | ||||||||||
| endpoint = self._readiness_endpoint | ||||||||||
| tags = list(self.config.tags or ()) | ||||||||||
|
|
||||||||||
| try: | ||||||||||
| response = self.http.get(endpoint) | ||||||||||
| except RequestException as e: | ||||||||||
| self.log.warning("Could not reach n8n readiness endpoint %s: %s", endpoint, e) | ||||||||||
| self.gauge('readiness.check', 0, tags=tags + ['status_code:none']) | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
nit: could be good to add the status_code when it's available (HTTP error).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it is, isn't it? Or am I misunderstanding your suggestion? is_ready = response.status_code == 200
self.gauge(
'readiness.check',
1 if is_ready else 0,
tags=tags + [f'status_code:{response.status_code}'],
)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm suggesting we set it on the failure path, inside the: except RequestException as e:
self.log.warning("Could not reach n8n readiness endpoint %s: %s", endpoint, e)
self.gauge('readiness.check', 0, tags=tags + ['status_code:none'])
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aaah, ok. When there is a Any other error (non 2xx) goes through the other branch where we add the code in the tag. I checked, just in case the Wrapper was doing the The failure path here carries no
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah I see, I missed that!
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good, updated now. |
||||||||||
| return | ||||||||||
|
|
||||||||||
| is_ready = 200 <= response.status_code < 300 | ||||||||||
| self.gauge( | ||||||||||
| 'readiness.check', | ||||||||||
| 1 if is_ready else 0, | ||||||||||
| tags=tags + [f'status_code:{response.status_code}'], | ||||||||||
| ) | ||||||||||
|
|
||||||||||
| def check(self, instance: dict[str, Any]) -> None: | ||||||||||
| self._check_n8n_readiness() | ||||||||||
| super().check(instance) | ||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a fix, docs guidelines say that the capitalization here should be as the one I modified it to. I got a similar comment in my READMEs for KrakenD. Unless it has been modified and the docs are not up to date. Docs
The content support skill still mentions this guideline. Skill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, we'd need to fix in the template and for all integrations. This is what all the integrations readme as following, and that's the way it's displayed in the public integration docs. Not sure if we do some preprocessing on the docs side (can take a look later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do, until this I didn't realize the template was doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, so I guess we could handle it separately.