perf: Parallelize metric generation to reduce scrape latency and fix race conditions

**Description**

code: https://github.com/Project-HAMi/HAMi-WebUI/blob/358165f426bc640b5563a07df9219f85e456e24d/server/internal/exporter/exporter.go#L85

Currently, the `GenerateDeviceMetrics` and `GenerateContainerMetrics` functions in the exporter iterate through devices and containers serially. When querying device status (e.g., via DCGM or other provider APIs), each operation involves I/O latency.

In clusters with a large number of devices (e.g., 100+ GPUs) or when individual device queries are slow, the total scrape duration accumulates linearly (`O(N)`), potentially leading to timeouts (e.g., exceeding Prometheus `scrape_timeout` of 10s). User reports indicate scrape times reaching 4-5 seconds in a 20-nodes environments.

Additionally, the `GenerateMetrics` function lacks synchronization, which can lead to race conditions if multiple Prometheus scrapes occur simultaneously or if a scrape occurs while metrics are being reset/populated.

**Proposed Changes**
1.  **Parallelization**: Refactor the device iteration loops in `GenerateDeviceMetrics` and `GenerateContainerMetrics` to use Goroutines and `sync.WaitGroup`. This allows device metrics to be collected concurrently, reducing the total scrape time from the sum of all device latencies to the maximum latency of a single device (`O(1)` effectively).
2.  **Concurrency Control**: Introduce a `sync.Mutex` in `MetricsGenerator` to lock the critical section of the scrape cycle (Reset -> Collect -> Cache). This prevents data races and ensures that concurrent scrape requests wait for the ongoing collection to complete (or hit the cache) rather than corrupting the data.

**Benefits**
*   Significantly reduced `/metrics` response time, especially for nodes with many devices.
*   Improved stability and prevention of data race issues during concurrent scrapes.
*   Better scalability for large-scale AI clusters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Parallelize metric generation to reduce scrape latency and fix race conditions #67

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf: Parallelize metric generation to reduce scrape latency and fix race conditions #67

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions