
The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.
Vendor
NVIDIA
Company Website

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes. Leading deep learning frameworks such as Caffe2, Chainer, MxNet, PyTorch, and TensorFlow have integrated NCCL to accelerate deep learning training on multi-GPU multi-node systems.
Features
- Automatic Topology Detection: Detects high bandwidth paths on AMD, ARM, PCI Gen4, and IB HDR.
- High Performance: Achieves up to 2x peak bandwidth with in-network all-reduce operations utilizing SHARPV2.
- Graph Search: Optimizes the set of rings and trees for highest bandwidth and lowest latency.
- Multi-Threaded and Multi-Process Support: Compatible with single-threaded, multi-threaded, and multi-process applications.
- InfiniBand and Networking Support: Supports InfiniBand verbs, libfabric, RoCE, and IP Socket internode communication.
- Adaptive Routing: Reroutes traffic to alleviate congested ports with InfiniBand Adaptive routing.
Benefits
- Ease of Programming: Uses a simple C API, accessible from various programming languages, and closely follows the popular collectives API defined by MPI.
- Compatibility: Compatible with virtually any multi-GPU parallelization model.
- Performance Optimization: Removes the need for developers to optimize their applications for specific machines, providing fast collectives over multiple GPUs both within and across nodes.
- Scalability: Massively scales deep learning training with optimized communication primitives.