Description
code:
|
for _, device := range deviceInfos { |
Currently, the GenerateDeviceMetrics and GenerateContainerMetrics functions in the exporter iterate through devices and containers serially. When querying device status (e.g., via DCGM or other provider APIs), each operation involves I/O latency.
In clusters with a large number of devices (e.g., 100+ GPUs) or when individual device queries are slow, the total scrape duration accumulates linearly (O(N)), potentially leading to timeouts (e.g., exceeding Prometheus scrape_timeout of 10s). User reports indicate scrape times reaching 4-5 seconds in a 20-nodes environments.
Additionally, the GenerateMetrics function lacks synchronization, which can lead to race conditions if multiple Prometheus scrapes occur simultaneously or if a scrape occurs while metrics are being reset/populated.
Proposed Changes
- Parallelization: Refactor the device iteration loops in
GenerateDeviceMetrics and GenerateContainerMetrics to use Goroutines and sync.WaitGroup. This allows device metrics to be collected concurrently, reducing the total scrape time from the sum of all device latencies to the maximum latency of a single device (O(1) effectively).
- Concurrency Control: Introduce a
sync.Mutex in MetricsGenerator to lock the critical section of the scrape cycle (Reset -> Collect -> Cache). This prevents data races and ensures that concurrent scrape requests wait for the ongoing collection to complete (or hit the cache) rather than corrupting the data.
Benefits
- Significantly reduced
/metrics response time, especially for nodes with many devices.
- Improved stability and prevention of data race issues during concurrent scrapes.
- Better scalability for large-scale AI clusters.
Description
code:
HAMi-WebUI/server/internal/exporter/exporter.go
Line 85 in 358165f
Currently, the
GenerateDeviceMetricsandGenerateContainerMetricsfunctions in the exporter iterate through devices and containers serially. When querying device status (e.g., via DCGM or other provider APIs), each operation involves I/O latency.In clusters with a large number of devices (e.g., 100+ GPUs) or when individual device queries are slow, the total scrape duration accumulates linearly (
O(N)), potentially leading to timeouts (e.g., exceeding Prometheusscrape_timeoutof 10s). User reports indicate scrape times reaching 4-5 seconds in a 20-nodes environments.Additionally, the
GenerateMetricsfunction lacks synchronization, which can lead to race conditions if multiple Prometheus scrapes occur simultaneously or if a scrape occurs while metrics are being reset/populated.Proposed Changes
GenerateDeviceMetricsandGenerateContainerMetricsto use Goroutines andsync.WaitGroup. This allows device metrics to be collected concurrently, reducing the total scrape time from the sum of all device latencies to the maximum latency of a single device (O(1)effectively).sync.MutexinMetricsGeneratorto lock the critical section of the scrape cycle (Reset -> Collect -> Cache). This prevents data races and ensures that concurrent scrape requests wait for the ongoing collection to complete (or hit the cache) rather than corrupting the data.Benefits
/metricsresponse time, especially for nodes with many devices.