Logo
Sign in
Product Logo
NVIDIA Data Center GPU Manager (DCGM)NVIDIA

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA Datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts, and governance policies including power and clock management. 

Vendor

Vendor

NVIDIA

Company Website

Company Website

dcgm-icon.png
Product details

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools designed to manage and monitor NVIDIA Datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts, and governance policies including power and clock management. DCGM simplifies GPU administration in the data center, improves resource reliability and uptime, automates administrative tasks, and helps drive overall infrastructure efficiency.

Features

  • GPU Diagnostics and System Validation: Effectively identify failures, performance degradations, power inefficiencies, and their root causes.
  • GPU Telemetry: Gather a rich set of GPU telemetry to explain job behavior, identify opportunities to drive utilization and efficiencies, and determine root causes of potential application performance issues.
  • Active GPU Health Monitoring: Use low-overhead, non-invasive health monitoring while jobs run without impacting application behavior and performance.
  • Integration with Management Ecosystem: Easily deploy a DCGM-based monitoring solution in a Kubernetes cluster environment. Out-of-the-box integration with various ISV solutions such as Bright Cluster Manager, IBM Spectrum LSF, and open-source tools such as Prometheus.

Benefits

  • Enhanced Performance and Reliability: Proactively identify potential problems and optimize GPU performance to maintain the efficiency and reliability of data center operations.
  • Simplified Administration: Automate administrative tasks and improve resource reliability and uptime.
  • Comprehensive Monitoring: Provide real-time monitoring and alerting for GPU metrics and health data, ensuring a comprehensive overview of the GPU cluster's status.
  • Scalability: Supports Linux operating systems on x86_64 and aarch64 (sbsa) platforms, with integration into the Kubernetes ecosystem using DCGM-Exporter.