Logo
Sign in
Product Logo
NVIDIA Collective Communications Library (NCCL)NVIDIA

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.

Vendor

Vendor

NVIDIA

Company Website

Company Website

NCCL_1GPU_multiGPU.png
Product details

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes. Leading deep learning frameworks such as Caffe2, Chainer, MxNet, PyTorch, and TensorFlow have integrated NCCL to accelerate deep learning training on multi-GPU multi-node systems.

Features

  • Automatic Topology Detection: Detects high bandwidth paths on AMD, ARM, PCI Gen4, and IB HDR.
  • High Performance: Achieves up to 2x peak bandwidth with in-network all-reduce operations utilizing SHARPV2.
  • Graph Search: Optimizes the set of rings and trees for highest bandwidth and lowest latency.
  • Multi-Threaded and Multi-Process Support: Compatible with single-threaded, multi-threaded, and multi-process applications.
  • InfiniBand and Networking Support: Supports InfiniBand verbs, libfabric, RoCE, and IP Socket internode communication.
  • Adaptive Routing: Reroutes traffic to alleviate congested ports with InfiniBand Adaptive routing.

Benefits

  • Ease of Programming: Uses a simple C API, accessible from various programming languages, and closely follows the popular collectives API defined by MPI.
  • Compatibility: Compatible with virtually any multi-GPU parallelization model.
  • Performance Optimization: Removes the need for developers to optimize their applications for specific machines, providing fast collectives over multiple GPUs both within and across nodes.
  • Scalability: Massively scales deep learning training with optimized communication primitives.
Find more products by category
Development SoftwareView all